A proper evaluation method for daily demand density forecasts

(1)

Master’s Thesis

A proper evaluation method for

daily demand density forecasts

Joran Roor

Student number: 10218882

Date of final version: July 13, 2017

Master’s programme: Econometrics

Specialisation: Big Data and Business Analytics

Supervisor: Prof. dr. C. Diks

Second reader: Dhr. dr. K. J. van Garderen

(2)

Preface

Studying this topic widened my perspective about the practical applicability of the theory and techniques I have been learning over the past years. What started as a collection of highly theoretical courses, now comes together in a practical manner.

I would like to thank my supervisor Professor Cees Diks for our inspiring meet-ings, after which I always exactly knew how to continue my research. The amount of knowledge he has on such a variety of topics has always amazed me. I could not have imagined a better supervisor. Also, I would like to thank co-founder of Optiply Sander van den Broek for all the help and advice during the process. Furthermore, I would like to express my thankfulness to second reader dr. Kees Jan van Garderen for reading and criticising my paper. Lastly, I would like to thank my study mates for the feedback they gave on my work.

(3)

Statement of Originality

This document is written by Joran Roor who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

This paper tests the properness of the KLIC-based density forecasting evaluation method of Mitchell and Hall (2005) when it’s applied on heteroscedastic and au-tocorrelated demand data. The method seems to be proper under the condition that the demand realisations of the product of interest contains at least one posi-tive sale. Furthermore, the test is more powerful when realisations within product groups are combined for the test. Applying the evaluation method on all products combined, the auto.arima forecasting method of Hyndman and Khandakar (2008) seems to perform significantly better than nnetar and ets.

(6)

1 Introduction

Many companies these days are interested in forecasting a variety of variables. This can help them in making important decisions affecting the optimization of their pro-cesses and maximizing their profit. It is very common that, for example economists, search for a forecasting model which best describes reality. But it is often forgotten that in most cases forecasting performance is the most important feature, instead of the model fit. Typically, when the dataset is small, simple models with a small number of parameters can outperform complex models (describing reality) in terms of forecasting ability.

Density forecasting has as an advantage over point forecasting in the sense that it gives a full distribution of the variable of interest, instead of just the mean or mode. This can be of good use in cases where an exact estimation is not needed but an upper- or lower bound is of interest. For example, in inventory management for shops, it is of interest to know how many products should be at least in stock to satisfy (almost) all demand without making too much inventory costs. Further-more, when information in the tails is momentous, when demand is close to zero for example, density forecasts describe how a variable behaves in this area.

Setting up a webshop these days seems to be within reach for everyone. At the beginning of 2017, The Netherlands counted over 30.000 different registered web-shops. When a webshop grows, more inventory is needed to match the demand and external warehouses become necessary. Inventory management gets more complex and storage costs increase.

Optiply is a startup developing supply management software for webshops. This software advises when and how many products to order. Inventory management is a very operational research process, where lots of calculations are done in order to design the most profitable process. This requires an optimal trade-off between deliv-ery assurance and inventory cost minimization. This optimization process requires knowledge about future demand of specific products, which is gained by forecasting methods. Optiply uses multiple forecasting techniques to forecast the demand of

(7)

tens of thousands of different products offered by multiple webshops. This raises the question which forecasting method performs best and in what cases.

Optiply is still developing forecasting methods which use lots of external vari-ables, such as weather data, Google trends and Google analytics. It combines ma-chine learning techniques with classical statistical time-series models to use the avail-able information as efficiently as possible. Unfortunately, this new method is not ready to be evaluated yet. Therefore, the (much more simple) current forecasting methods are considered in this paper that are provided by Hyndman and Khandakar (2008), including ets (Exponential Smoothing State Space Model), nnetar (Neural Network Time Series Forecasts) and auto.arima (fits best ARIMA model to univariate time series).

This paper is set up as follows. In Chapter 2 some density forecasting evaluation methods are discussed with some extra focus on the Kullback-Leibler Information Criterion (KLIC). In Chapter 3 the data is summarized, and it is explained how the size and power of a KLIC-based evaluation method is estimated. Furthermore, there is described how this evaluation method can be applied in practice. Chapter 4 shows the results of the size and power estimation. Chapter 5 summarizes and suggests further research.

(8)

2 Theoretical background

2.1 Scoring rules for density forecast evaluation

A straightforward way to compare forecast performance is by comparing loss func-tions and check which method results in the least loss, for example by using the mean squared forecasting error (MSFE). Stock and Watson (2004) calculate the ra-tio of MSFE’s of two competing forecasts, also called the relative MSFE. The relative MSFE is equal to 1 under the null hypothesis. They state that it is unclear what distribution this statistic has under the null hypothesis and it is therefore impossible to statistically test a difference in forecasting performance based on this measure.

Hyndman and Koehler (2006) introduce the Mean Absolute Scaled Error (MASE), which scores forecasting methods based on how they perform compared to the na¨ıve forecast. Assume {f1, f2, ..., fn} are n forecasts of the ex post out-of-sample

realisa-tions {y1, y2, ..., yn}. The MASE is defined as

qt= yt− ft 1 n−1 n P i=2 |yi− yi−1| .

The MASE is the current forecast evaluation method used by Optiply. However, also the MASE is not able to give statistical evidence that the errors of two competing forecasting methods significantly differ from each other.

Diebold et al. (1998) derived a forecasting evaluation method based on the probability integral transform (PIT). The PIT uses the fact that if a random variable X has a cumulative density function F (.), then F (X) has a Unif (0, 1) distribution. To evaluate the performance of a density forecast G(.) the uniformity of G(X) is tested. However, the PIT is not suitable for comparing competing density forecasts. Bao et al. (2004) use the KLIC in combination with the PIT to evaluate and compare density forecasts. If the density is correctly specified a normal transforma-tion of the uniformly distributed PIT results in a normally distributed variable. Bao et al. (2004) also use the censored likelihood to estimate the tail minimum KLIC.

(9)

2.2 KLIC-based Evaluation Method

Giacomini and White (2006) introduce a method to compare two competing den-sity forecasting methods on their conditional predictive ability. They develop a test for the null hypothesis that the expected difference of the losses of two competing forecasts is equal to zero, given the available information set. This is used to de-rive a Wald-type test statistic, where its asymptotic Chi Square distribution can statistically detect a significant difference in conditional predictive ability.

Mitchell and Hall (2005) use an earlier version of the test of Giacomini and White (2006) to derive a forecasting evaluation method based on the KLIC. Normally, the KLIC is a distance measure of a density forecast to the true density. Since in practice the true density is typically unknown it is usually not possible to calculate this criterion. However, if the difference in performance between two density forecasts is of interest, the KLIC can be of use, which is defined as

KLICt= Z ft(yt) ln f_t(yt) gt(yt) dyt = E[ln ft(yt) − ln gt(yt)].

As Mitchell and Hall (2005) show, a dt can be defined where

dt= [ln ft(yt) − ln g1t(yt)] − [ln ft(yt) − ln g2t(yt)]

= ln g2t(yt) − ln g1t(yt),

(1) such that

E[dt] = KLIC1t− KLIC2t.

This means that the difference of the KLIC’s of two competing forecasts g1t(yt)

and g2t(yt) can be measured by the expectation of the log-likelihood differences, the

estimation of which does not involve the true density ft(.). The null hypothesis of

the KLIC-based method of Mitchell and Hall (2005) is given by H0 : KLIC1t− KLIC2t = E(dt) = 0,

which is related to the likelihood ratio test statistic of Berkowitz (2001). To estimate E(dt), the sample mean ¯d is calculated over all T realisations as

¯ d = 1 T T X t=1 dt. (2)

(10)

Then, under the null hypothesis, the limiting distribution result ¯ d q Sd T d − → N(0, 1) (3)

holds, where Sd is a HAC robust (to serial correlation and heteroscedasticity)

esti-mator of the long run variance of dt.

In the case of demand data, it is very important to use a HAC robust estimator since demand data is typically serial correlated and heteroscedastic. The HAC robust estimator is calculated using some rules provided by Diks et al. (2011).

Sd= ˆγ0+ 2 K−1

X

k=1

akγˆk,

where ˆγk denotes the lag-k sample covariance of the sequence {dt+1}T −1t=m and ak the

Bartlett weights ak = 1 − k K and K = bn14c.

This makes it possible to statistically test whether the KLIC of a density forecast is significantly different from the KLIC of some competing density forecast, without knowing the true density. According to Mitchell and Hall (2005) and Giacomini and White (2006) the KLIC-based evaluation method permits some heteroscedasticity, however, it is not very clear how much in practice is allowed. Due to, for example, seasonality and time trends, demand data are assumed to be heteroscedastic. There-fore, it should be tested whether the evaluation method has good size and power properties applied on demand data.

The KLIC-based evaluation method is very useful for evaluating two density forecasts. Yet, in practice, the evaluation of more than two density forecasts is of more interest. Hansen et al. (2011) provide the MCS procedure which can handle multiple comparisons. First, all forecasts are ranked according to their performance for the data at hand (best performing first), and subsequently the null hypothesis of equal performance of all density forecasts is tested against the alternative that

(11)

the lowest ranked forecast performs worse. Secondly, if this hypothesis is rejected, the worst performing density forecast is left out. These steps are repeated until the null hypothesis can not be rejected anymore. The remaining density forecasts comprise the MCS, meaning that they all perform equally well at the given level of significance.

2.3 The ‘forecast’ package

Hyndman and Khandakar (2008) built an R-package providing many tools to auto-matically fit forecasting models on time series. All models are based on univariate time series so do not take external variables into account. In their paper, Hyndman and Khandakar (2008) explain two important methods, namely ets and auto.arima.

ets stands for Exponential Smoothing State Space Model and fits univariate time series using a moving window. The difference of this technique compared to regular moving averages is that ets weights observations on how old they are. This means that ets assigns exponentially decreasing weights to past observations. Multiple models are tested and chosen based on Akaike’s Information Criterion (AIC). The method is very suitable for non-seasonal and non-time trending time series (Hynd-man and Khandakar, 2008).

The auto.arima function fits the best ARIMA model to univariate time series. Also, this method considers multiple models and selects the best by using the AIC. One method also included in the package, but not in Hyndman and Khandakar (2008), is nnetar. nnetar fits multiple lag variables on a neural network with multiple units in one hidden layer.

(12)

3 Method

3.1 Data description

The data provided by Optiply originates from a company specializing in managing webshops. The dataset consists of daily demand data of 1807 different products. The assortment keeps on changing, which means that every product has a different starting date at which the selling was started, which results in a different amount of information available across products.

In most cases, daily demand of a product in a webshop is a very difficult variable to measure. At Optiply, the demand for a product on a specific date is assumed to be equal to the amount of sales of that product made on that day. However, this assumption does not hold in the days where the product is out of stock because no sales can be made on that day, while the demand could be positive. In those days, unregistered lost sales can occur and, therefore, the daily sales (all equal to zero) are not a good estimator for the daily demand any more. Hence, for the days that a certain product is out of stock, Optiply estimates the demand by a moving average of historic and future demand.

3.2 Demand distribution assumption

Since, in Section 4, size and power calculations are being made based on a simulation study, some density should be chosen to create a close to reality data generating process used in de simulation study. However, the only information that is known are the observed demand data time series. It is assumed that the demand is non-stationary and heterogeneous, and therefore the density can not be estimated by bootstrapping or non-parametric methods. Hence, some assumptions should be made to determine the data generating process properly.

Bain and Engelhardt (1992) explain what a Poisson process is and what condi-tions must hold for a process to be Poisson. Since the demand is a discrete variable, usually products are made one at a time and consumers are assumed to make inde-pendent decisions, the Poisson distribution comes out as a reasonable distribution

(13)

assumption.

3.3 Size estimation

Since the demand density is assumed to have a Poisson distribution, in the simu-lations the demand data is generated by a Poisson distribution with a λ equal to the realised demand. Next, demand data can be generated by taking for each time period a random draw of the daily demand densities.

To estimate the size of the KLIC-based evaluation method, a situation must be created where the null hypothesis holds. This means that two density forecasts are needed with “equal forecasting performance” such that they are expected to have the same average KLIC (the same average distance to the true density). In order to achieve this, first two density forecast series are created. One of these series always lies 20% above all true densities and the other one 20% below. Secondly, two competing density forecasts can be created by, at each time period, randomly assigning with equal probability (1₂) one of the two generated densities to one of the two competing forecasts. This is clarified in Example 3.1.

Example 3.1. Assume that a realisation yt is a random draw from its true density

which is defined as

ft(yt) = POI(λt).

Now, two forecast series are generated using the true density. ρ(a)_t = 1.2 ∗ λt

ρ(b)_t = 0.8 ∗ λt.

Next, the parameters of the two competing density forecasts are determined as µ(1)_t = ut∗ ρ (a) t + (1 − ut) ∗ ρ (b) t µ(2)_t = ut∗ ρ (b) t + (1 − ut) ∗ ρ (a) t

with ut ∼ Round(Unif (0, 1)). The two competing density forecasts g1t(yt) and

g2t(yt) are now defined as

g1t(yt) = POI(µ (1) t ) g2t(yt) = POI(µ (2) t ).

(14)

This means that both KLIC’s of the competing forecasts are expected to converge to the same number such that Equation (1) is equal to zero. In this case, the null hypothesis holds.

Under the null hypothesis, the test described in Section 2.2 is expected to reject 100 ∗ α% of the time, given a nominal size α. All simulations then have a rejection probability equal to α. This implies that the total number of rejections after s simulations has a negative binomial distribution

r ∼ NB(s, α),

where r is the total number of rejections. Using this distribution, a 95% confidence interval can be constructed to check whether the estimated size is equal to the nominal size under the null hypothesis.

P r ∈ {NB−1(s, 0.025), NB−1(s, 0.975)} = 0.95,

with NB−1 the inverse cumulative negative binomial distribution. If the number of rejections lies outside this interval, the testing criteria can be adjusted such that it lies inside the interval. In the cases where the test is too conservative, the testing criteria are made less strict which is in favour of the power. On the other hand, if the test rejects too often, the testing criteria are made more strict at expense of the power.

After simulating 10,000 times, 10,000 test statistics are computed as in Equa-tion (3). Sorting all these values and getting the Round((1 − α₂) ∗ 10000) and the Round(α₂ ∗ 10000) element results in the upper- and lower bound of the two-sided test to create a size equal to the nominal size. Next, only for all products with a size outside of the confidence interval these new critical values will replace the standard normal ones and are used in the power estimation of the next section. Note that after this step, for all products with no proper size in the first place, the standard normal distribution as in (3) is not used anymore.

(15)

3.4 Power estimation

To estimate the power of the KLIC-based forecast evaluation method a situation must be created such that the null hypothesis does not hold. This is certainly the case when one of the two competing density forecasts is equal to the true density, and meanwhile the other one has some plausible distance to the true density. This plausible distance can be reached by changing the parameter of the Poisson distri-bution. The difference of this parameter can be chosen in a number of ways such as testing multiple values, or choosing 0.5 such that it mostly predicts the wrong integer. In this thesis, the difference of the parameter is chosen in such a way that it is equal to the standard deviation of all demand realisation of the product to get a difference which is likely to occur in practice. The way of constructing the two competing forecasts is clarified in Example 3.2.

Example 3.2. Assume that a realisation yt is a random draw from its true density

which is defined as

ft(yt) = POI(λt).

Now, two competing density forecasts are generated using the true density. g1t(yt) = POI(λt) g2t(yt) = POI(λt+ sdy), with sdy = v u u t 1 T − 1 T X t=1 (yt− ¯y)2.

In this case, the null hypothesis does not hold, since (1) is not expected to be zero. Under the alternative, the test given in Section 2.2 is expected to reject. The higher the rejection rate, the higher the power of the test. Preferably, the rejection rate is (close to) 1.

As described in Section 3.3, the critical values of the test are adjusted for products with a not proper estimated size such that they become proper. For these specific products, the adjusted critical values are now used as a rejection region for the power estimation.

(16)

3.5 Applied density forecast evaluation

During the simulations for estimating the power and the size, always the full infor-mation set is used as realisations to evaluate the forecast on. This gives an indication regarding on what products the method works properly. However, in practice, one part of the data (typically the largest part) is used to train the forecasting algorithm on, meanwhile the second, typically small part, is used to evaluate the forecast. Since this test set of realisations can be very small, lots of power could be lost compared to the case where the full information set is used for evaluation.

To redeem this loss in power the information set of realisations and density forecasts of different products are combined to each other. They can for example be categorized on webshop, product category or popularity. In this way, it is possible to statistically test the performance of competing density forecasting methods for specific groups of products, which combined have a large information set. The combining of information is done as follows. First, for every product in the product category, the differences as in (1) are calculated for all t ∈ {1, 2, ..., T }. Secondly, the mean of these distances is taken over all products, for each t separately. Thirdly, the average and the HAC estimator of the variance are calculated over these means. Lastly, the test statistic is calculated as in (3). The steps above are clarified in Example 3.3.

Example 3.3. Assume that Q products are available in the product category and the test information set is of length T. For each product and time period, the difference in logarithmic distances is calculated as

dqt= ln g2qt(yqt) − ln g1qt(yqt).

Using this result, the mean difference is calculated over all products for each time period separately, that is,

¯ dt = 1 Q Q X q=1 dqt.

(17)

Now, the mean over all time periods is calculated as ¯ d = 1 T T X t=1 ¯ dt.

The test statistic is

¯ d q Sd T d − → N(0, 1), (4)

with Sd the HAC estimator of the long run variance of ¯dt.

Note that the density forecasts generated by ets, nnetar and auto.arima are based on a ”fixed estimation sample” forecasting scheme, which involves estimating the models’ parameters only once over the in-sample data and using these to produce all out-of-sample forecasts. Although their main focus is on rolling window meth-ods, Giacomini and White (2006) state that their results are also valid for a ”fixed estimation sample” forecasting scheme.

Since the power is highly affected by the amount of information used during the forecast evaluation, multiple numbers of products are combined to give a view of the behaviour of the method. To compare the performance of the multiple (more than two) forecasting methods described in Section 2.3, the MCS procedure of Hansen et al. (2011) is used as described in Section 2.2. To apply this procedure, a method must be found to test the null hypothesis of equal performance regarding more than two forecasting methods. This is done using the following steps:

1. A ranking is made of all forecasting methods based on the average (over time and products) log-likelihood score of their density forecast on the test set. 2. The forecasting method with the lowest average log-likelihood score is picked

out to be tested one-sided on equal performance to the other methods.

3. The distribution of the likelihood scores of all other methods combined (all methods but the worst performing method) is generated by using bootstrap. This is done by for each five days randomly select one forecasting method, which is used to calculate the daily average likelihood over all products. Sub-sequently, the mean is taken over all these daily averages. This step is repeated

(18)

10,000 times to result in an approximate distribution of the average combined log-likelihood score.

4. The average log-likelihood score of the worst performing method is tested one-sided of being equal to the score of the other methods combined, using the distribution approximated in step 3. If the null hypothesis of equal perfor-mance is rejected, the worst performing forecasting method is left out.

5. Steps 1-4 are repeated until there cannot be rejected any more, or when there are two forecasting methods left. In the first case, the remaining forecasting methods are concluded to be of equal performance and best performing. In the last case, the standard one-to-one comparison is used as in Example 3.3. This procedure is clarified in Example 3.4 for the case of comparing three forecasting methods.

Example 3.4. Assume there are three density forecasts: g1qt(.), g2qt(.) and g3qt(.)

for products q ∈ {1, 2, ..., Q} and time periods t ∈ {1, 2, ..., T }. The average log-likelihood score of density forecast j is calculated by taking the mean over time of all average log-likelihood scores over products.

sjqt = log(gjqt(yt)) ¯ sjt = 1 Q Q X q=1 log(gjqt(yt)) ¯ sj = 1 T T X t=1 sjt.

Now, assume that ¯s1 > ¯s2 > ¯s3 such that Forecasting Method 3 is considered as the

worst performing forecast. The null hypothesis that is tested is H0 : E(s1) = E(s2) = E(s3).

The bootstrap result of bootstrap iteration b is generated as

sb_1,2 = 1 T T /5−5 X t=1 at t+5 X k=t ¯ s1k+ (1 − at) t+5 X k=t ¯ s2k ! ,

(19)

with at∼ Round(Unif (0, 1)). This is repeated B times, resulting in vector

S_1,2B = {s1_1,2, s2_1,2, ..., sB_1,2}. The null hypothesis reduces to

H0 : E(sb1,2) = E(s3).

If s3 is smaller than (1 − α)B elements of S1,2B , the null hypothesis is rejected at

significance level α. After rejection, Forecasting Method 3 is left out. The remaining two forecasting methods are tested on equal performance using the procedure explained in Example 3.3.

Since Optiply is interested in forecasting 90 days ahead, the number of real-isations T is set equal to 90. To make sure enough information is available for forecasting and evaluation, only products are selected with more than 360 observa-tions. Because of the high number of observations, no bias problems are expected due to this selection.

(20)

(a) Estimated size as expected

(b) Estimated size higher than expected

Figure 1: An example of the estimated size of the test applied on two different products with α ∈ [0; 0.25]. The black diagonal line corresponds to the case where the estimated size is equal to the nominal size. The light blue area around this diagonal line is the 95% confidence interval of the size estimation.

4 Estimation results

4.1 Size and power estimation

Figure 2a shows an example of the size estimation results of the test against the nominal size applied on two different products. As can be observed in Figure 1a, the estimated size of the test applied on the first product lies very near the nominal size and is always within the 95% confidence interval of the nominal size. For this

(21)

Figure 2: The estimated power, plotted against the estimated size for a nominal size α equal to 5%.

product, the estimated size behaves as expected in Section 3.3. On the other hand, Figure 1b shows an example of a product where the estimated size does not behave as expected. The estimation is not near the nominal size and does not lie within the confidence interval. This second product does not have a proper size yet.

To get a good view of the power of the test applied on all 1807 products a scatter plot is shown in Figure 2 of the estimated power against the estimated size for a nominal size equal to 5%. As discussed earlier, in the optimal case the estimated size is equal to 5% and the power to 100%. As can be observed in this figure, a lot of products seem to converge to somewhere near this optimal point. However, at the bottom some kind of cluster appears, which consists of products with very low power. When all products with zero realized demand are coloured red, the low power at the bottom might be explained. The products with at least one positive demand realisation (black dots) have an average power of approximately 91% and almost all of them lie above 50%.

As mentioned in Section 3.3, the estimated sizes outside the 95% confidence interval are considered as not proper. As shown in Figure 2, part of the products lies outside this interval for α = 5%. Adjusting the critical values such that the size

(22)

Figure 3: The estimated power, plotted against the estimated size for a nominal size α equal to 5% after correcting all unproper sizes.

becomes equal to the nominal size results in a total collection of estimated sizes all within the confidence interval. This is shown in Figure 3.

4.2 Comparing predictive ability

As described in Sections 2.3 and 3.5, the ets, nnetar and auto.arima forecasting meth-ods are evaluated. Since the results of Section 4.1 imply that the test is not very powerful for products with no positive demand realisation in the sample, these are left out of the sample test set. Also, products with no positive demand in the train-ing sample are ignored. These constraints, combined with those in the last part of Section 3.5, result in a product set of 493 items.

Table 1a shows the average log-likelihood of all three forecasting methods for a product group containing only one product. As can be observed from this table, nnetar is indicated to be the worst performing forecasting method, and therefore will be tested against the other two methods, using the bootstrap method described in Section 3.5. The bootstrap estimation of the distribution of the average likelihood score of ets and auto.arima combined is shown in Figure 4a. As can be observed in this figure, the average log-likelihood score of nnetar is significantly lower than

(23)

ETS NNETAR AUTO.ARIMA

score -0.457 -0.490 -0.474

ranking 1 3 2

(a) One product in product group

ETS NNETAR AUTO.ARIMA

score -0.234 -0.270 -0.232

ranking 2 3 1

(b) All 493 products in product group

Table 1: Comparing the predictive ability of the ets, nnetar and auto.arima forecasting methods on 90 demand realisations on different product groups, containing only products satisfying the constraint of including at least one positive demand realisation. The first row shows the average log-likelihood score for all forecasting methods and the second row shows their ranking in forecasting performance based on this score. Other tables of specific product groups are available on request.

those of ets and auto.arima combined. The null hypothesis of equal performance is rejected and the worst performing method nnetar is left out. The last step is testing ets (density forecast 1) and auto.arima (density forecast 2) on equal performance using the procedure of Example 3.3. The test statistic as in (4) in this case is equal to −1.951 which implies the null hypothesis can be rejected at significance level 0.0254. Therefore, in this product group, ets is significantly the best performing forecasting method given a confidence level of 5%.

Table 1b shows the average log-likelihood of all three forecasting methods for a product group containing all 493 products. As can be observed from this table, again nnetar is indicated to be the worst performing forecasting method, and therefore will be tested against the other two methods, using the bootstrap method. The bootstrap estimation of the distribution of the average likelihood score of ets and auto.arima combined is shown in Figure 4b. As can be observed in this figure, the average log-likelihood score of nnetar is significantly lower than those of ets and auto.arima combined. The null hypotheses of equal performance is rejected and the worst performing method nnetar is left out. The last step is testing auto.arima (density forecast 1) and ets (density forecast 2) on equal performance using the procedure of Example 3.3. The test statistic as in (4) in this case is equal to −2.410 which implies the null hypotheses can be rejected at significance level 0.00796. Therefore, in this

(24)

(a) One product in product group

(b) All 493 products in product group

Figure 4: The distribution of the average log-likelihood score of ets and auto.arima com-bined, estimated by bootstrapping, for two different product groups. The red line refers to the average log-likelihood score of the worst performing forecasting method, nnetar, which lies in both cases left outside the distribution

product group, auto.arima is significantly the best performing forecasting method given a confidence level of 5%.

(25)

5 Summary and Discussion

Optiply is interested in the forecasting performance of the ets, nnetar and auto.arima forecasting methods, which are used to forecast the density of demand data of web-shops. The KLIC-based evaluation method of Mitchell and Hall (2005) offers a way of comparing two competing density forecasts in their forecasting performance and can be used without knowing the true density of the process. Mitchell and Hall (2005) and Giacomini and White (2006) state that some heteroscedasticity in the data is permitted but are not very clear about which type of heteroscedasticity and how much. Since in this research the forecasting methods are applied on demand data, which are assumed to be heteroscedastic and autocorrelated due to, for ex-ample, seasonality and time trends, some Monte-Carlo size and power simulations are required to approve the properness of the test. In other words, there must be checked whether this evaluation method has good size and power properties.

For the size simulations, the two competing density forecasts are constructed in such a way that they have the same expected logarithmic distance (KLIC-distance) to the true density so that the null hypothesis holds and the test is not expected to reject. For the power simulations, the two density forecasts are constructed such that one of them is equal to the true density and the other one has some plausible distance so that the null hypothesis does not hold and the test is expected to reject. From the size estimation over all products it turned out that almost all estimated sizes were not significantly different from the nominal size α = 5%, and if they were, the rejection region could be adjusted in such a way that the estimated size became equal to the nominal size, under the distribution assumptions made in the simulations. The power estimation showed that the test is powerful on products with at least one positive realisation since in this case the power in almost all cases lies above 50% and the average power is equal to 91%. This leads to the conclusion that the test is proper and powerful to heteroscedastic and autocorrelated demand data under the condition that there is at least one positive demand realisation.

(26)

data with at least one positive sale, it is applied to the forecasting methods used in practice. Following the procedure of Hansen et al. (2011) it can be concluded that, overall, the auto.arima method performs significantly better than nnetar and ets when applied on the total dataset. Choosing other product groups can lead to different conclusions.

This research focusses on evaluating the total distribution of the demand data. Though, in some cases, the performance of a density forecast in specific regions is of interest. For example, when a webshop thinks highly of customer service and never wants the stock to be zero when there is positive demand, the density close to zero is of high importance. At the evaluation, the errors in this region of interest should have a higher weight than normal. Diks et al. (2011) provide KLIC-based scoring rules to compare density forecasts in tails. Before applying this test, also this method needs to be tested for properness when applied to heteroscedastic and autocorrelated processes.

Also, further research could focus on which product groups form a good combi-nation in evaluating a forecasting technique. Only products which are expected to have the same optimal forecasting technique should be grouped together. In this thesis, only a limited number of group combinations are used, however, it is shown that for some products not the best performing forecasting method is used in the case where the evaluation group is equal to the total product set.

(27)

References

Bain, L. J. and Engelhardt, M. (1992). Introduction to Probability and Mathematical Statistics. Cengage Learning, Inc. 103-107.

Bao, Y, Lee, T and Saltoglu, B (2004). A test for density forecast comparison with applications to risk management. Working Paper 04-08. UC Riverside.

Berkowitz, J (2001). Testing density forecasts, with applications to risk management. Journal of Business & Economic Statistics, 9, 465–474.

Diebold, F.X., Gunther, T.A. and Tay, A.S. (1998). Evaluating density forecasts with applications to financial risk management. International Economic Review, 39, 863–883.

Diks, C., Panchenko, V. and van Dijk, D. (2011). Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163, 215–230. Giacomini, R. and White, H. (2006). Tests of conditional predictive ability.

Econo-metrica, 74, 1545–1578.

Hansen, P.R., Lunde, A. and Nason, J.M. (2011). The model confidence set. Econo-metrica, 79, 453–497.

Hyndman, R. and Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27, 1–22.

Hyndman, R.J and Koehler, A.B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22, 129–132.

Mitchell, J. and Hall, S.G. (2005). Evaluating, Comparing and Combining Density Forecasts Using the KLIC with an Application to the Bank of England and NIESR ‘Fan’ Charts of Inflation. Oxford Bulletin of Economics and Statistics, 67, 995– 1033.

(28)

Stock, J.H. and Watson, M.W. (2004). Combination forecasts of output growth in a seven-country data set. Journal of Forecasting, 23, 405–430.

A proper evaluation method for daily demand density forecasts

Master’s Thesis