Forecasting with generalized ﬁxed-eﬀects regression models using L1-regularization

(1)

Forecasting with generalized

fixed-effects regression models using

L1-regularization

D.M. Posthuma

S2562944

(2)

Master’s Thesis Econometrics, Operational Research and Actuarial Studies Supervisor: prof. dr. R.H. Koning

(3)

Forecasting with generalized fixed-effects regression models using

L1-regularization

D.M. Posthuma

Abstract

This paper evaluates penalized generalized fixed-effects regression models for forecast-ing demand in the context of volatile demand patterns and high-frequency of zero values. Three different single distribution fitting models: the Gaussian, Poisson and negative binomial form, are considered. Regression models with time-invariant product-specific covariates are used to allow for individual variation across products. The fixed-effects approach is applied since the product-specific fixed-effects are correlated with other covariates included in the model. To prevent overfitting and reduce model complexity caused by a large number of product-specific covariates, an L1-penalized maximum like-lihood analysis is performed. The penalized generalized fixed-effects regression models are compared to one another and to a naive method; the product-specific historical means of the quantity of daily sales. Application of the models to a data set on the quantity of daily sales for 86 laundry detergents demonstrates the prediction perfor-mances. The out-of-sample predictions are compared using the AE (absolute error), MSE (mean squared error) and sMSE (scaled mean squared error) as accuracy mea-sures. We find that the penalized fixed-effects regression models based upon count distributions are preferred over the penalized fixed-effects Gaussian model.

(4)

1 Introduction

A critical issue encountered in inventory control is demand forecasting. The importance of accurate demand forecasting to efficient inventory management has long been recognized in several studies. Gurnani and Tang (1999) point out that accurate sales forecasting plays a crucial role in a profitable retail business. Consistently accurate sales forecasts enable a company to ensure the right amount of stock without over- or understocking. This positively affects sales and minimizes the costs of stock. Furthermore, offering the right products at the right time results in high customer satisfaction as mentioned in Heikkil¨a (2002). In the current commercial landscape, in which retailers operate in a competitive sector, an effective inventory management system is becoming increasingly important as described by Fisher et al. (1994). Customers who cannot purchase the product of interest in a chosen shop or receive a delayed delivery are likely to be unsatisfied which increases the risk of customers switching to competitive suppliers.

Major retailers, such as large e-commerce businesses, typically have a diverse assortment consisting of an extensive number of products. The demand patterns for different products vary considerably. Although for some products there is consistently high-volume demand, for a large proportion of products there is tremendous variation in demand. The volatile demand patterns increase the difficulty in determining the precise amount of inventory. An important factor that makes demand volatile is the sequence of zero values in a demand se-ries. Demand with occasional peaks interspersed by time intervals during which no demand occurs is referred to as intermittent demand. Because of its irregularity and unpredictable zero values, intermittent demand is frequently related to inaccurate demand forecasting. Snyder et al. (2012) mention that the small amount of random demand with many zero values incurs costly forecasting errors in forms of unmet demand or obsolescent stock. As a result, companies either risk losing sales and customers when items are out of stock or are being burdened with excessive inventory cost.

(6)

product-specific variation in demand. It is self-evident that every product is different and has its own product-specific characteristics. Product-specific effects can capture any unobserved effects that are different across products but fixed across time. Therefore, product-specific param-eters should be included in the forecasting models to allow for individual variation across products. However, in the case of a diverse assortment, the number of predictor variables in a regression model can be large due to the estimation of a high number of product-specific effects. As a result, the following two potential complications arise. First, it is hard to say which of the independent variables are relevant and which ones are irrelevant. Secondly, the models are very prone to overfitting. Therefore, it is interesting to investigate techniques that reduce model complexity and prevent overfitting.

Keeping the above-described reasons in mind, we would like to examine various forecasting approaches to predict the volume of demand in the context of volatile demand patterns and high-frequency of zero values. Considering the aforementioned constraints, the following research question is formulated:

Which model provides the best prediction for demand in the context of volatile demand pat-terns and high-frequency of zero values?

This main research question is divided into separate parts which can be turned into the following sub-questions:

• Which models can

– successfully handle products for which demand is intermittent? – deal with time-dependent covariates?

• How can we reduce model complexity and prevent overfitting in case of a large number of covariates?

(7)

2 Background literature

Demand forecasting has been the subject of research in multiple fields. Numerous studies have focused on time series forecasting. This is an essential area of forecasting in which historical observations of the dependent variable are obtained and analyzed to develop a model describing the underlying process. Mentzer and Cox Jr. (1984) present the results of a survey of 175 midwestern business people to determine the degree of usage and accuracy obtained of different forecasting techniques. They found that Moving Average (MA), expo-nential smoothing (ES) and regression were well-known and widely used approaches. This research was performed in 1984 and advances in technology have fundamentally changed the way in which surveys and data are collected nowadays. According to Kolassa (2016), the massive increase in computing power and new database structures allow data to be stored and processed at finer granularities. Due to large volume data-gathering and the storage of lower and lower counts, count data and intermittent demand forecasting form active research fields and are becoming increasingly important in the future. The data set used in this paper can be characterized by the following three main features. For a large proportion of products, the demand is intermittent. Furthermore, we observe high peaks in the volume of sales which coincide with promotional periods. Lastly, there are gaps in the demand series when a product is out of stock. These three characteristics should be taken into consideration in the investigation of appropriate forecasting methods. Note that the data set is described in more detail in Section 3.

In this chapter, preliminary literature research for demand forecasting is given. First, we define the notation which is generally used in this paper. Thereafter, we focus on studies both about traditional quantitative forecasting methods and forecasting techniques used for count data. Next, we proceed to highlight different methods that can be used to compare model performances. Lastly, we underline the contribution of this research to the extended literature.

2.1 General notation

There has been great diversity in the notation style throughout literature. Therefore, Table 1 provides the notation which is consistently used in the remainder of this paper.

Notation Description

t time

i individual product

n length of training and validation data

N length of test data

q number of products

Sit observed quantity of sales of product i at time t

ˆ

Sit predicted quantity of sales of product i at time t

Pit price in euros for product i at time t

Table 1: General notation

2.2 Traditional quantitative methods

(8)

used in different domains of literature. An example of this is the study of Taylor (2003). He forecasts short-term electricity demand by using the ARIMA approach. However, the analyzed time series are often affected by many elements such as political changes, weather temperature and exchange rates. The traditional forecasting methods usually do not allow for explanatory variables. An exception is an ARIMAX model, a widely used extension of the ARIMA model, which includes covariates as additional time series in the model. For example, Fan et al. (2009) analyze the tertiary-industry in China. They use both ARIMA and ARIMAX approaches, an extension in which they incorporate the realty business time series as an input time series, to analyze and forecast tertiary-industry production value time series. As expected, the ARIMAX model brings better prediction results than the ARIMA model. However, the main drawback of both models is that the time series, both the analyzed as input time series, should be stationary.

Another representative traditional quantitative method is exponential smoothing (ES). De-mand forecasting for high-volume count data, e.g. univariate time series of high sales vol-ume products or sufficiently highly aggregated data, can be handled successfully using ES methods as shown in Babai et al. (2013). However, these methods have proven to be inap-propriate when demand is intermittent as discussed in numerous studies. As an example, Syntetos and Boylan (2001) underline the fact that exponential smoothing methods place more weight on the most recent data. In the case of intermittent demand forecasting, this results in a series of estimations that are high just after a demand occurrence and lower just before demand occurs again. Furthermore, optimization of exponential smoothing is based on the squared or absolute error as shown in Gardner (1985). Consequently, Wallstr¨om and Segerstedt (2010) conclude that for data with a large proportion of intermittent demand, the optimization function based on squared or absolute errors will focus on the periods with zero values. Therefore, exponential smoothing generates widely biased estimates when there is a mass around zero values observations. Modifications to exponential smoothing methods to account for this are introduced by Tydeman (1972) and Wright (1986). However, the risk for bias still persists after the modifications.

Snyder et al. (2012) examine various approaches to demand forecasting for slow-moving products for which demand is intermittent and of low volume. They focus on the need for inventory planning when the underlying demand processes of the products may be non-stationary. This emphasis leads to the considerations for processes with time-dependent parameters and they examine different prediction distributions. One of their findings is that inventory planning should be based upon forecasting models using a distribution that describes count data, rather than the normal distribution. This is in line with other stud-ies such as those of Cameron and Trivedi (2013) and Davis et al. (2016). Both studstud-ies mention that integer-valued time series like count data are characterized by some stylized facts such as low values, over-frequency of zeros, over-dispersion and symmetric marginal distributions. They conclude that continuous-valued time series models are inappropriate for modeling integer-valued time series.

(9)

In summary, the normality assumption is often considered for modeling demand since it is attractive from a theoretical perspective. The normal distribution is known to provide a good empirical fit to observed high-volume demand data. For example, in the context of fast-moving products where demand occurs regularly and is not particularly variable. However, in the case of volatile demand and high-frequency of zero values, the normality assumption is judged to be far from appropriate. Therefore, forecasting methods which require that the error term follows a normal distribution seem to be inappropriate in the presence of intermittent demand. Besides the normality assumption, the theoretical issues on the assumption of stationary time series make ARIMA(X) unsuitable for this research. Furthermore, the lack of time-dependent covariates in ES and Croston’s method ensures that these models are inappropriate in this research. All in all, traditional quantitative methods cannot be used to properly explain the volatile demand patterns in the context of high-frequency zero values or variation, that is presumably caused by time-dependent covariates.

2.3 Regression methods

Instead of using traditional quantitative methods, demand forecasting can also be based on regression analysis. Contrary to focusing on a historical pattern, the relationship between dependent and independent variables is used to predict future values. Note that regression methods can deal with time-dependent covariates such as price, promotional activities and competitors’ prices. Poisson regression models provide the standard framework for the anal-ysis of count data. This type of regression model has several advantages over an ordinary linear regression model, including a skew, discrete distribution and the restriction of pre-dicted values to non-negative numbers.

(10)

two data sets of daily retail sales. He only takes into account products for which more than 300 historical data points are available. Likewise as Snyder et al. (2012), Kolassa (2016) fits the various regression models to each single times series separately. Therefore, each product×model combination has been modeled to forecast sales on the product-level. His conclusions are consistent with the prior mentioned literature. He finds that sales time se-ries with low counts should be dealt with models using count distributions rather than the continuous probability distribution.

In summary, extended literature provides support for the contention that count distribu-tions are preferred over the normal distribution to fit low-volume demand. Furthermore, it is interesting to investigate the maximum likelihood approach in a panel data setting applied to multiple time series instead of each time series separately. Inspired by previous studies, the first sub-question in this research will be answered as follows:

We would like to examine regression methods rather than the traditional quantitative fore-casting approaches. The focus will be on the Gaussian, Poisson (a commonly used benchmark for demand data) and negative binomial (a widely used extension of Poisson) distribution. Based on the literature, regression models are supposed to be better in adapting to fluctua-tions and changes in demand series. Therefore, single distribution fitting models will be used in this research. They depend on the precondition that demand follows a certain kind of distribution. We expect that for intermittent and low-volume demand the count regression models are more capable in predicting and yield better results than the Gaussian regression model. This hypothesis is based on the research above and the fact that the restriction of predicted demand to non-negative numbers is particularly relevant for low counts. Further-more, it is quite likely that the variance of the quantity of daily sales is larger than the mean. We think that the assumption of the negative binomial distribution is preferred over the Poisson distribution. Lastly, we expect that in the case of high-volume demand there is not a remarkable difference in prediction performance between models with the assumption that demand follows a count or Gaussian distribution. Consequently, the question arises how the prediction performance of the examined models can be measured and compared.

2.4 Model evaluation

The aim of this research is to determine the most accurate model in forecasting demand in the context of volatile demand patterns and high-frequency of zero values. Therefore, the comparison of prediction performance among various model specifications pertains to the core part of this research.

(11)

Keeping this in mind, the third sub-question is answered in the following way:

(12)

3 Data

This research uses data on laundry detergents of bol.com for the period from 01-01-2017 un-til 10-09-2019. The detergent assortment contains a large proportion of products for which demand is intermittent. The variability of demand patterns and the variability of demand size pose a challenge in demand forecasting.

Demand and sales naturally go hand in hand. However, there is a difference between the two. Sales are the process by which people pay money to acquire something they demand. The companies should manage their production capacity, technology and infrastructure to be able to deliver the demand of customers. Companies want to narrow the gap between demand and sales, by optimizing their inventory capacity and delivery times. Nonetheless, there are still periods in which companies cannot offer what customers desire. Consequently, sales volume is sometimes lower than demand. To be able to predict demand rather than solely sales volume, we use offer data additional to sales data. The sales data provides the quantity of daily sales of each product per day. The offer data contains information about the stock level of each product per day. This means that there is no offer when a product is out of stock. Figure 1 provides an overview of how we apply the combination of offer and sales data in this research. The blue blocks represent the three possible measurements of our dependent variable, the quantity of daily sales, in our data set. If there is an offer at time t, then there may be zero or a number of products sold. On the other hand, if there is no offer at time t the item is out of stock and as a consequence there is no transaction possible as represented by NA in the data set. In this case, demand cannot be directly observed. Con-sequently, we do not measure the total demand for detergents in this study. We measure the quantity of daily sales given that offers are positive. Therefore, we conduct a reduced-form analysis since there is no structural identification for our dependent variable. As a result, we can only interpret the results of the products which are included in this research.

NA Number of sales 0 YES YES NO NO Is there a sale at time t? Is there an offer at time t?

Figure 1: Set up of the dependent variable: the quantity of daily sales

(13)

Figure 2: Normalized daily sales volume curve for four individual products. Note: the red bars indicate the bulk period

The rest of this chapter is structured as follows. First, we will give an overview of the different data sets collected by bol.com. Relevant variables are introduced and explained. Secondly, we describe how we convert the different data sets in a daily panel structure that can be used for modeling. Thereafter, we explain the product selection and elaborate on outliers and inconsistencies in the data set. Thereafter, statistics of the final data set are given. Lastly, we describe the technique used for normalization.

3.1 Overview of raw data

As previously mentioned, we use offer data with information about stock levels additional to sales data. Furthermore, we use data sets that contain information about promotions per product, prices of competitors and campaigns within bol.com. An explanation of the data sets is given below.

Offers

(14)

Sales

This is a very basic data set with information regarding the quantity of daily sales of bol.com. To be clear, the price for which the items are sold can differ per order because the sales can be ordered from different sellers. Therefore, the data set contains information about the minimum, maximum and mode price per day. The mode price is the price for which the most items are sold on a given day. Furthermore, there is knowledge about whether the items are ordered in the Netherlands or in Belgium.

Promotions

This data set contains information about active promotions for products on a given day. Bol.com uses various promotional techniques. Examples are price off, cheapest product free and day/week deals. Price off is the most frequently used promotion technique in the deter-gent assortment capturing 96% of the promotions. This action provides a price reduction of a certain percentage for products. Another attractive promotion of bol.com is the day/week deal which is a push promotional strategy on the homepage. Customers become more aware of the day/week deal because the promotional products are displayed on the homepage. The promotional strategy differs per day/week deal, but most of the time it is a price reduction as well.

Campaigns

This is a data set with information about days with large campaigns within bol.com. It shows whether there is a bulk period or whether a specific day is labeled as ’holiday’. Examples of ’holidays’ are Singles Day, Black Friday, Sinterklaas, Valentine’s Day or Christmas.

3.2 Panel data structure

The data set used in this paper is a combination of the above-mentioned sources. For model-ing and estimation, a clear structure containmodel-ing all relevant variables is preferred. Therefore, a daily panel structure is generated in the following manner.

First, the offer data for detergents is loaded for the period from 01-01-2017 until 10-09-2019. We only take into account the best offer and omit the remaining offers. In this way, we cover an extensive proportion of the sales in the detergent assortment while removing potential outliers in which customers are searching for a specific seller. The research focuses on the time frame from 01-01-2017 until 08-10-2019, because of two reasons: the availability of the four data sources determines the starting date. Further, the test data set which is used for prediction has to cover the bulk period which starts at the end of September. We will explain more about the splitting of our data set in the next section. Second, we load the sales data. We merge the offer and sales data sets on date and product ID. We obtain a data set containing the best offer and the quantity of daily sales for all detergents of bol.com for a period of approximately 20 months. Next, the promotion data set is loaded and merged on date and product ID. To continue, the campaign data set is loaded and joined on date only since these campaigns are not product-specific.

(15)

Variable Class Format Description

Date Date 2018-01-16 Date

ProductID Character 9200000087063245 Product ID

Sales Integer 100 Quantity of daily sales

Price Numeric 14.99 Price

PriceStar Factor 5 Relative price measure

Promotion Dummy 1 Price off dummy

Day Action Dummy 1 Day/week deal dummy

Bulk Dummy 1 Bulk dummy

Table 2: Variables of interest

Level zero implies that bol.com has an extremely high price compared to competitors in the same price-quality tier. Level one indicates that bol.com does not have price information of A-competitors. From level 2 to the highest level 5, the price of bol.com becomes lower and more attractive for costumers relative to prices of A-competitors. Thus, the higher the price star the better is the price provided by bol.com. This variable is useful to account for inter-store promotional effects as described in Walters (1991). Furthermore, the data set contains information whether or not there is a price off, day/week deal and bulk period represented by the dummy variables Promotion, DayAction and Bulk.

3.3 Final data set

We have a large data set that needs to be slightly adjusted before it is suitable for model-ing. The detergent assortment of bol.com contains 1407 unique products in the period from 01-01-2017 to 08-10-2019. Since the forecast will be used for inventory planning the items that are frequently sold are most important. We reduce the number of products to 171 detergents which cover 85% of the total sales volume during the observation period. This is an enormous reduction in individual products while losing only 15% of the total sales volume. The second criterion is that there has to be a minimum of one offer in the last two weeks. This criterion is chosen because we are only interested in products that are still in the assortment. In this selection, we omit another 68 products resulting in a sample of 103 products. The third criterion is that products do have a longer history than 14 months. The period of 14 months is chosen to ensure that a product was in the assortment during at least two bulk periods. By applying this criterion, 17 more products are omitted which results in the final sample of 86 products. As previously mentioned, there is only an observation for product i at time t given that there is an offer at time t. In Appendix A.1, two plots with information about the length of individual demand series are provided. Note that after the selection criteria there still persists a wide range in the number of observations per product. Furthermore, there are inconsistencies and outliers in the data set which require attention. As first, we delete observations where the price star is zero because bol.com does not provide these offers on the website. By doing this, we detect that exceptional prices are removed and only observations with positive prices remain in the data set. Further, two observations with price reductions larger than 80% of the average historical price are omitted. Finally, our data set consists of 60,885 rows representing offers and the quantity of daily sales of 86 unique laundry detergents. Note that the final data set still contains the products which are shown in Figure 2.

(16)

The test data set concerns the period from 11-09-2019 to 08-10-2019 which covers the bulk period that is starting at the end of September. This results in a test data set consisting of 2086 rows which will never be used to train the models. In Chapter 4, we will explain how we apply cross-validation on the training and validation part.

train and validation test

Figure 3: Overview of the split in the data set

3.4 Statistics of data set

The descriptive statistics of the historical data set are given in Table 3. The 58,769 obser-vations are collected during the period from 01-01-2017 until 10-09-2019 which covers 983 days. The descriptive statistics for sales are required upon request because of confidential reasons.

Variable N Mean St. Dev. Min Max

Sales 58,769 Price 58,769 23.03 12.72 2.750 79.70 PriceStar 58,769 3.475 1.062 1.000 5.000 Promotion 58,769 0.313 0.463 0.000 1.000 Day Action 58,769 0.003 0.051 0.000 1.000 Bulk 58,769 0.080 0.272 0.000 1.000

Table 3: Descriptive statistics of historical data set

As mentioned before, the variable P riceStar accounts for the inter-store promotional effects. Walters (1991) also detects substitution effects within the store as a result of promotions. Figure 4 displays the cross-correlogram of the daily sales volume of the top 5 sold products during history. The ith diagonal plot in this graph is simply the correlogram of the ith prod-uct, given by {(h, ˆρii(h)) : h = 0, 1, 2, ...} where h indicates the lag. The off-diagonal plots

contain the estimates of cross-correlation between the five different products: for i < j we plot {(h, ˆρij(h)) : h = 0, ±1, ±2, ...} and for i > j we plot {(h, ˆρji(h)) : h = 0, ±1, ±2, ...}.

(17)

Figure 4: Cross-correlogram of the daily sales volume of the top 5 sold detergents The negative correlations show evidence for substitution effects within the detergent assort-ment of bol.com. This is in line with the findings of Walters (1991), he detects substitution effects within the store as a result of promotions. As mentioned before, price offs and day/week deals are promotional activities of bol.com. We detect that there are merely 125 days without a price off. A histogram representing the distribution of the number of price offs per day is shown in Figure 15 in Appendix A.3. The high percentage of days with a price off ensures that the variable P romotion is inappropriate to account for within store promotional effects. The number of days with a day/week deal is around 50. Consequently, we will use Day Action to account for within store promotional effects. The final model specification and detailed description of the independent variables will be given in Section 4.3.

(18)

Figure 5: Kernel density plots of daily sales volume of all and three individual products

3.5 Min- max normalization

To prevent publishing absolute sales volumes of bol.com for privacy reasons, historical nor-malized patterns are graphed. We transform the daily sales volume to the range 0 to 1 by using min- max normalization which performs a linear transformation on the original data. Note that only historical sales volumes are transformed using min- max normalization. For the actual model estimation procedure data at the original scale are used. The min-max transformation is performed in the following manner. Let S be a n × 1 vector containing elements St, which are the sales volume for every day t, t = 1, ..., n. Let Y be the n × 1

normalized vector of sales volumes. Each element of vector Y, is obtained by using the following formulation

Yt=

St− min(S)

max(S) − min(S), (1)

where n is the number of days in the observation period. By applying this transformation, it means that the minimum value in S is mapped to 0 and the maximum value in S is mapped to 1. Hence, the entire range of daily sales volume for the minimum to the maximum is mapped to the range 0 to 1. For example, in Figure 2 the normalized daily sales volumes of four products are shown. In this case, let St = Sit where Sit are the sales volume of

(19)

4 Research Method

The goal of this paper is to accurately predict demand in the context of volatile demand patterns and high-frequency of zero values. Therefore, modeling variation in demand cor-rectly is paramount to accurately predict volumes of demand. In this section, we start by describing the regression models and investigate if these approaches are suitable for panel data. Next, the final model specifications are given. Thereafter, a technique to reduce model complexity and prevent overfitting is explained. Lastly, we discuss how the model evaluation is performed.

4.1 Generalized linear models

The count data regression models can be represented and understood using the GLM frame-work that emerged in the statistical literature in the early 1970s by Nelder and Wedderburn (1972). These models extend the linear modeling framework to variables that are not nor-mally distributed and do not force data into unnatural scales. According to McCullagh and Nelder (1989), the dependent variable is assumed to be generated from a particular distribution in an exponential family which is a large class of probability distributions that includes the Gaussian, binomial, Poisson and negative binomial distribution among others. These distributions are useful to fit the non-normal error structures of most data sets such as count and demand data. As explained in detail by McCullagh and Nelder (1989), the generalization of GLM consists of the following four elements:

• A random component; the distribution of the dependent variable y, of which the observation values yi are realizations, is generated from an exponential family

• A linear predictor which includes explanatory variables and relationships among them: ηi = β0+ β1x1i+ ... + βpxpi =

p

X

j=0

βjxTji (2)

• A link function, g(·), which specifies the relationship between the linear predictor and the mean, E(yi) = µi, of the dependent variable:

g(µi) = ηi (3)

µi= g−1(ηi) (4)

• A unit variance function, V(·), that describes the relationship between the variance and the mean and uniquely identifies the distribution:

var(yi) = a(φ)V(µi), (5)

where φ is the dispersion parameter and a(φ) is the dispersion function.

A key assumption of GLM is that the distribution of the dependent variable y is generated from an exponential family. The distribution of the dependent variable belongs to the exponential family when the probability density function (pdf) is of the form

fy(yi; θi, φ) = exp

yiθi− b(θi)

a(φ) + c(yi, φ)

, (6)

where θi is the canonical parameter that depends on the regressors via the linear predictor

(20)

The log-likelihood of the pdf given in (6) expressed as a function of θi and φ, is given by

l(θi, φ; yi) = log fy(yi; θi, φ) (7)

Consequently, the log-likelihood is given by l(θi, φ; yi) =

yiθi− b(θi)

a(φ) + c(yi, φ) (8)

According to Nelder and Wedderburn (1972), if yi has a distribution in the exponential

family then it has mean and variance

E(yi) = b0(θi) = µi, (9)

var(yi) = b00(θi)a(φ) = V(µi)a(φ) (10)

The proof can be found in Appendix A.4. The variance is the product of two functions: b00(θi) and a(φ). The first function, called the variance function, depends on θi. As can

be seen in (9), the mean can be expressed in θi. Therefore, the variance function only

de-pends on the mean and can be written as V(µi). The second function, called the dispersion

function, is independent of θi and depends only on φ. Notice that the variance function

V(µi) uniquely identifies its parent distribution type with the natural exponential family.

Consequently, if a member of the exponential family is specified, the variance structure is also determined.

Furthermore, an advantage of GLM is that estimates are obtained using the maximum likelihood (ML) approach. The log-likelihood based on a set of independent observations y1, ..., yn is just the sum of the individual contributions, so that

l(µ; y) =

n

X

i=1

logfi(yi; θi), (11)

where µ = (µ1, ..., µn). However, there are advantages in using not the log-likelihood function

as the goodness-of-fit criterion, but the scaled deviance which is given by

D∗(µ; y) = 2l(y; y) − 2l(µ; y) (12)

Note, that for the exponential family models (GLM) considered here, l(y; y) is the maxi-mum likelihood possible. This results in an exact fit in which the fitted values are equal to the actual. Furthermore, note that 2l(y; y) does not depend on the parameters. Therefore, maximizing 2l(y; y) is equivalent to minimizing D∗(µ; y) with respect to µ. The goodness-of-fit criteria will be used for model selection.

As mentioned in Section 2, we focus on the Gaussian, Poisson and negative binomial dis-tributions in this research. Therefore, we will show that these disdis-tributions belong to the exponential family such that the GLM framework can be used.

Gaussian distribution

We will show that the Gaussian distribution belongs to the exponential family. yi follows a

Gaussian distribution with mean µi and variance σi2, i.e. yi∼ N (µi, σ2). The range of yi is

(−∞, +∞) both negative and positive values and is symmetric in the mean. The pdf of the Gaussian distribution is given as follows

f (yi) = 1 √ 2πσ2exp ( −1 2 yi− µi σ 2) (13)

and belongs to the exponential family with θi = µi, b(θi) = 1₂θ2i and a(φ) = σ

2_{. The proof}

(21)

• The Gaussian distribution belongs to the exponential family • The linear predictor:

ηi = β0+ β1x1i+ ... + βpxpi = p

X

j=0

βjxTji (14)

• The identity link function:

ηi= g(E(yi)) = E(yi), (15)

since we model the mean directly. • The variance function:

var(yi) = a(φ)V(µi) = σ2, (16)

which holds if V(µi) = 1.

Poisson distribution

We will show that the Poisson distribution belongs to the exponential family. yi follows a

Poisson distribution with parameter µi> 0, i.e. yi ∼ Pois(µi). This is a discrete distribution

and the range of yiis non-negative. Further, it assumes that mean equals variance. The pdf

of the Poisson distribution is given by f (yi, µi) =

µyi

i exp(−µi)

yi!

(17) and belongs to the exponential family with θi = ln(µi), b(θi) = exp(θi) and a(φ) = 1. The

proof is given in Appendix A.6. Consequently, we have the following four elements of the GLM framework specified as:

• The Poisson distribution belongs to the exponential family • The linear predictor:

ηi = β0+ β1x1i+ ... + βpxpi = p

X

j=0

βjxTji (18)

• The log link function, g(·) = ln(·) gives:

ηi= ln(µi) (19)

• The variance function:

var(yi) = a(φ)V(µi) = µi, (20)

which holds if V(µi) = µi.

As previously mentioned, one key assumption of Poisson distribution is that mean, µi =

(22)

Negative binomial distribution

Quasi-Poisson and negative binomial regression models could be used for over- and under-dispersed count data. The quasi-Poisson model is only characterized by their mean and variance, and do not necessarily have a distributional form. Inference is difficult because estimates cannot be obtained due to the maximum likelihood approach. Therefore, we con-sider the negative binomial distribution which is a mixture between the Poisson and Gamma distribution. This distribution relaxes the assumption that the mean is equal to the variance and is therefore useful for over-dispersed counts i.e. when the variance is larger than the mean. The assumption that µi = exp(x0iβ) is maintained. We use the general notation

to denote the variance of yi as given in (10). The negative binomial distribution has the

following variance function

V (µi) = µi+ αµ2i, (21)

where α > 0. Note that this reduces to the Poisson distribution if α = 0. The α parameter can be used to adjust the variance independently of the mean. The pdf of the negative binomial distribution is given by

f (yi|µi, α) = Γ(yi+ α−1) Γ(yi+ 1)Γ(α−1) α−1 α−1_{+ µ} i !α−1 µi α−1_{+ µ} i yi , α ≥ 0 (22)

and belongs to the exponential family with b(θi) = −

1 αln 1 + exp(θi) 1 − exp(θi)) , θi = ln αµi 1 + αµi

, a(φ) = 1 and where α is known. The proof can be found in Appendix A.7. Consequently, we have the following four elements of the GLM framework specified as:

• The negative binomial distribution belongs to the exponential family • The linear predictor:

ηi = β0+ β1x1i+ ... + βpxpi = p

X

j=0

βjxTji (23)

• The log link function, g(·) = ln(·) gives:

ηi= ln(µi) (24)

• The variance function:

var(yi) = a(φ)V(µi) = µi+ αµ2i, (25)

which holds if V (µi) = µi+ αµ2i since a(φ) = 1.

4.1.1 Key assumptions for GLMs

(23)

meet the key assumption for the GLM framework. There is a general agreement that the assumption of homogenous and independent errors should be satisfied as stated in Cameron and Trivedi (2013) and Hoffmann (2004). However, McCullagh and Nelder (1989) suggest that the independence assumption can be relaxed to ”at least uncorrelated”. Furthermore, the normality assumption of errors in GLM is disputed. Dobson and Barnett (2008) and Hoffmann (2004) suppose that this assumption should be met to correctly interpret the results. Contrary, Gill and Torres (2019) note that normally distributed errors are simple a description of model behavior and not a key assumption for the GLM framework. All in all, the notion of p-values is often explained on symmetric Gaussian distributions. In this study, this is certainly not the case. Therefore, we compute the standard errors of the point estimates using the non-parametric bootstrap method as discussed in Efron and Tibshirani (1986).

4.2 Panel data

Panel data analysis distinguishes between fixed-effects and random-effects models. Fixed-effects models partial out the Fixed-effects of invariant variables and are suitable if the time-variant variables are main interest of research. Fixed-effects models explicitly deal with the fact that product-specific effects may be correlated with the time-dependent covariates. As previously mentioned, it is self-evident that every product is different and has its own characteristics. Therefore, it seems that fixed-effects models are preferred since they allow for individual variation. Note that the Hausman test is applied to test whether the assumption of using the fixed-effects model is correct. Since the obtained p-value, 1.662 × 10−10, is less than 0.05 the use of fixed-effects regression is recommended. The data is assumed to be independent over individual level for a given time but is permitted to be correlated over time for a given individual. According to Cameron and Trivedi (2013), we can use linear regression models with the addition of an individual specific term reflecting individual heterogeneity. This results in a fixed-effects model with separate parameters for each individual which capture the individual correlations over time. The fixed-effects Gaussian regression model for panel data includes the multiplicative individual specific terms without consequences since the mean is modeled directly. The final specification will be given later in this research. 4.2.1 Fixed-effects Poisson model

The fixed-effects Poisson regression model for panel data has been described in detail by Cameron and Trivedi (2013). The dependent variable Sit varies over products (i = 1, ..., q)

and over time (t = 1, ..., n). It is assumed that Sit follows a Poisson distribution with

parameter µit as given by

Sit∼ Pois[µit= γiλit] (26)

λit= exp(x0itβ), (27)

where γi is the individual effect of product i. Given the exponential form for λit, the

individual effects can still be interpreted as a shift in the intercept because of the following

E(Sit) = µit (28)

= γiexp(x0itβ) (29)

= exp(δi+ x0itβ), (30)

where δi= ln γi. Consequently, we have

ln µit= δi+ x0itβ (31)

(24)

in the slope coefficients of xit. Hsiao (2014) finds that for logistic regression the estimation

of fixed-effects models including individual dummy variables yields inconsistent estimates of β. Therefore, it is interesting to investigate whether this is a problem for the Poisson fixed-effects model. According to Cameron and Trivedi (2013), the estimates for β and the associated covariance matrix obtained by both conditional and unconditional maximum like-lihood approaches are always identical. Further, Windmeijer and Santos Silva (1997) agree that for the fixed-effects Poisson model there is no incidental parameters problem. Thus, we can apply fixed-effects Poisson regression models by unconditional maximum likelihood with dummies which reflects the individual heterogeneity. The final specification will be given later in this research.

4.2.2 Fixed-effects negative binomial model

Hausman et al. (1984) have proposed a conditional negative binomial model for panel data. However, Allison and Waterman (2002) show that this is not a valid fixed-effects model. The proposed model by Hausman et al. (1984) is based on a regression decomposition of the over-dispersion parameter rather than the usual regression decomposition of the mean as described in the fixed-effects Poisson model. Therefore, Allison and Waterman (2002) investigate three alternative methods of which one is an unconditional negative binomial regression estimator with dummy variables to represent the individual fixed effects. Based on a simulation study, they conclude that the unconditional negative binomial regression model is preferred to the two alternatives. Furthermore, Allison and Waterman (2002) also satisfy that there is no evidence for any incidental parameter bias in the coefficients. In this conventional negative binomial regression model, the fixed effects are directly estimated rather than conditioning them out of the likelihood. The final specification of the model used in our research will be given later.

For panel data, the link and variance functions are slightly adjusted since the individual observations are obtained over time. Table 4 gives an overview of the link and variance functions as used in this paper.

Family Notation Link function Range of Sit Variance function

Gaussian N (µit, σ2) µit (−∞, +∞) 1

Poisson Pois(µit) ln(µit) 0, 1, ..., ∞ µit

Negative binomial NBin(µit, α) ln(µit) 0, 1, ..., ∞ µit+ αµ2it

Table 4: Specifications for the generalized fixed-effects regression models

4.3 Final model specifications

We consider the GLMs with three different assumptions of the error distribution: Gaussian, Poisson and negative binomial. For each of these models, a final model specification is determined. The linear prediction function is the same in each model and is given by

ηit= δi+ β1lnPit+ β2P romit+ β3DayActit+ β4Bulkt+ (32)

β5P romit× Bulkit+ β6Compit+ 4

X

k=1

β6+kP riceStarit,

where δi is the fixed effect capturing the individual characteristics of product i. These

(25)

Effect coding contrasts product means with the grand mean rather than with a specified product. The variable Pitis the price of product i at time t. The variable is log-transformed

to capture percentage effects instead of absolute effects. The variables P romitand DayActit

are dummies indicating whether or not there is a price off or day/week deal at time t for each product i, respectively. For example, P romit takes value 1 if there is a price off at

time t for product i. Further, Bulkt is a dummy variable which indicates whether time t

coincides with a bulk period. Additionally, the interaction effect between P romitand Bulkt

is added. We suggest that the effect of price offs on sales is different during than outside the bulk. We think that the customers are more aware of the promotions during the bulk period. Therefore, we include the interaction effect between P romit and Bulkt.

Further-more, Compit is a dummy variable that account for the within store promotional effects.

It takes value 1 if there is a day/week deal at time t for any other product than product i. P ricestarit are effect-coded dummy variables which indicate the relative price level for

product i at time t and therefore control for inter-store promotional effects. Fixed-effects Gaussian model

The first proposed model is the fixed-effects Gaussian regression model. We assume that quantity of daily sales, Sit, follows a Gaussian distribution such that Sit∼ N (µit, σ2). Using

the link function provided in Table 4, we have the following specification

E(Sit) = ηit= δi+ β1lnPit+ β2P romit+ β3DayActit+ β4Bulkt+ (33)

β5P romit× Bulkit+ β6Compit+ 4

X

k=1

β6+kP riceStarit,

where ηit is defined in (32) and where E(Sit) is the expected count of Sit for product i at

time t and the explanatory variables are defined before. Fixed-effects Poisson model

The second proposed model is the fixed-effects Poisson regression model. For this model we assume that quantity of daily sales, Sit, follows a Poisson distribution such that Sit∼

Pois(µit). Using the link function provided in Table 4, we have the following specification

E(Sit) = µit= exp(ηit) (34)

= expnδi+ β1lnPit+ β2P romit+ β3DayActit+ β4Bulkt+ (35)

β5P romit× Bulkit+ β6Compit+ 4 X k=1 β6+kP riceStarit o ,

where ηit is defined in (32) and where E(Sit) is the expected count of Sit for product i at

time t and the explanatory variables are defined before. Fixed-effects negative binomial model

The third examined model is the fixed-effects negative binomial regression model. For this model we assume that quantity of daily sales, Sit, follows a negative binomial distribution

such that Sit∼ NBin(µit, µit+ αµ2it). The negative binomial regression model has the same

(26)

As mentioned above, the final model are estimated by maximizing the unconditional log-likelihood functions of the parameters for the data observed. Let xitbe a matrix where each

row presents the explanatory variables as mentioned above for specific product i at time t. According to Greene (2004), the log-likelihood for a sample of q repeated observations on products i is given by log L = q X i=1   ni X t=1 log g(Sit, x0itβ + δi, α)   (36)

where ni is the possibly varying number of observations for each individual i and where α

is the over-dispersion parameter which only appears in the negative binomial form.

4.4 Regularization techniques

In the above-specified models, the number of predictor variables can be large due to the estimation of product-specific covariates. To reduce the model complexity and prevent over-fitting, regularization techniques can be used as discussed in Owen (2007). Ridge and lasso regressions are regularized versions of least squares regression using L1 and L2 penalties respectively. The difference between the techniques is that lasso can reduce coefficients to zero, while for ridge regression the shrinkage for the coefficients can approach zero but never actually become equal to zero. The main benefit of the lasso is that it can result in sparse models (i.e. with fewer parameters). Sparsity is desirable for interpretation. Therefore, we answer the second sub-question in this research as follows:

Lasso regression, also known as the L1 regularization technique, is used to reduce model complexity and prevent overfitting.

Lasso regression is introduced by Tibshirani (1996). This regularization process adds the lasso penalty proportional to the absolute value of the magnitude of regression coefficients and tries to minimize them. This can result in sparse solutions because some coefficients of redundant variables shrink to zero and will be eliminated from the model. The goal of this regularization process is to minimize the prediction error. In practice, the tuning parameter, λ, controls the strength of the penalty. The larger is the tuning parameter the more number of coefficients are shrunk to zero and ensures a reduction in dimensionality. For a sufficiently large λ, all coefficients are forced to be zero. This regularization process results in the ad-vantage of reducing model complexity. The model is easier to interpret due to the reduction in coefficients. Another advantage is that prediction can be improved, because shrinking and eliminating coefficients can reduce the variance without a considerable increase in the bias due to addition of the lasso penalty. In terms of the tuning parameter λ, the variance decreases and the bias rises as λ increases.

The log-likelihood function as described in (36) should be adjusted with the lasso regression penalty. Let ζ be the (q + 9) × 1 vector defined as ζ = (δ1, ..., δq−1, β1, β2, ..., β10). The

penalized log-likelihood function will be as follows

log L = q X i=1   ni X t=1 log g(Sit, x0itβ + δi, α)  − λ q+9 X k=1 |ζk| , (37)

where β = (β1, ..., β10) and where λ is the tuning parameter. The regularization procedure

(27)

4.4.1 Tuning parameter optimization

The selection of the optimal value for the tuning parameter is based on cross-validation. Cross-validation is a resampling procedure used to evaluate the penalized models on a limited data sample. We apply k-fold cross-validation to our historical data set to determine which lasso regression penalty results in the lowest cross-validation error. The approach involves randomly dividing the historical data set into k non-overlapping folds of approximately equal size. The following procedure is iterated over the k-folds:

• The model is fitted using k − 1 of the folds as training data.

• The model is validated on the remainder part, kth part, of the data and is used to calculate the cross-validation error.

The cross-validation error reported by k-fold cross-validation is then the average of the values computed in the aforementioned iterations. This enables us to identify the optimal value for λ. Therefore, k-fold cross-validation helps choose the value of the tuning parameter which minimizes the cross-validation error.

In this research, the cross-validation error is measured by the penalized log-likelihood func-tion as expressed in (37) which should be maximized. In the case of Gaussian linear ap-proaches, the mean squared error is used as cross-validation error. Parameters chosen to maximize the log-likelihood are exactly those chosen to minimize the mean squared error in Gaussian regression forms. Another option is the deviance as explained before in Section 4.1 since maximizing the log-likelihood is equivalent to minimizing the deviance goodness of fit. The goodness of fit measures such as log-likelihood and deviance are useful in determining the model fit. However, these measures cannot be used for predictive performance compar-ison across various models. In the following section, we explain how we could measure and compare predictive performance.

4.5 Model evaluation

Since forecasting is our main interest, we focus on the out-of-sample predictive performance of the penalized fixed-effects GLMs. Accuracy measures adopted in this paper are: absolute error (AE), mean squared error (MSE) and scaled mean squared error (sMSE) in accordance with guidelines reported by Petropoulos and Kourentzes (2015). We adopt the scaled error instead of percentage errors because our time series contain periods with zero observed demand. The out-of-sample accuracy measures are computed by the following formulas

AE = q X i=1 Ni X t=1 |Sit− ˆSit| (38) MSE = 1 N q X i=1 Ni X t=1 (Sit− ˆSit)2 (39) sMSE = 1 N q X i=1 Ni X t=1 Sit− ˆSit 1 ni Pni t=1Sit !2 , (40)

where Ni is the possibly varying number of observations for each individual i in the test

data set, niare the number of historical observations of product i. Sitare the actual values

and ˆSitare the predicted values of the quantity of daily sales of product i at time t. Further,

(28)

4.5.1 Naive method

We compare the out-of-sample predictions obtained by the penalized fixed-effects general-ized linear models with one another and with the naive method. This baseline method is determined as the historical means of the quantity of daily sales for specific product i. Therefore, the naive method gives for a specific product i at time t the following prediction

ˆ Sit= 1 ni ni X t=1 Sit, (41)

(29)

5 Results

To obtain predictions of demand in the presence of highly volatile demand patterns, three types of penalized generalized fixed-effects regression models are examined. We consider the Gaussian, Poisson and negative binomial distributions. In this section, we analyze the model performances using sales and offer data on laundry detergents from bol.com. We will first show how the tuning parameter for regularization is estimated. Next, we compare the final model performances based on out-of-sample accuracy measures. Thereafter, we elaborate on the negative binomial fixed-effects model specification which is found to be the best performing model.

5.1 Lasso regression results

In this section, we discuss the estimated results for the tuning parameter. 10-fold cross-validation is used to determine the optimal value of λ that minimizes the cross-cross-validation curve. This optimal value of the tuning parameter is referred to as λmin. Additionally, λ1se

is determined which gives the most regularized model such that the cross-validated error is within one standard error of the minimum value.

Fixed-effects Gaussian model

Figure 6 represents the lasso regression results for the fixed-effects Gaussian model. The horizontal axis represents the logarithm of the tuning parameter. The horizontal axis above indicates the number of non-zero coefficients at the current λ which is the degrees of freedom in the lasso regression. The left panel presents the shrinkage of the coefficients graphically. The magnitude of the coefficients is measured on the vertical axis and each curve corresponds to a covariate. The plot shows the path of its coefficient against the logarithm value of λ. As expected, the larger the tuning parameter is, the more coefficients are reduced to zero. The right panel provides the 10-fold cross-validation curve represented by the red dotted line along the λ values. The error bars indicate both upper and lower standard deviation curves. The vertical axis represents the mean squared cross-validated error. The two selected tuning parameter values (λmin and λ1se) are indicated by the left and right vertical dotted lines

respectively. The estimated values of both selected tuning parameters can be found in the first column of Table 8 in Appendix A.8. Note that the detailed specification of the fixed-effects Gaussian model can be found in Section 4.3.

Considering Figure 6, the model that is selected by the the optimal tuning parameter gives the original model including 95 variables without the intercept. As a result of the one stan-dard error rule, 34 coefficients are shrunk to zero. Therefore, the sparse model selected by λ1se includes 61 variables without the intercept. We present the estimated results of

both models in Table 10 in Appendix B.1. Point estimates are provided for both models in the first and second columns respectively. As a result of increasing the tuning parameter from λmin to λ1se both a few parameters of time-dependent as product-specific covariates

are reduced to zero. Since our primary goal is to predict the demand volume rather than estimating the specific effects of the time-dependent covariates, we will only interpret the point estimates of the best performing model later in this results section.

Fixed-effects Poisson model

The lasso regression results for the fixed-effects Poisson model are shown in Figure 7. Sim-ilar to above findings, the left panel shows that more coefficients are reduced to zero if the tuning parameter increases. In the right panel, the vertical axis represents the Pois-son cross-validation deviance. The model selected by λmin is the original model including

95 coefficients without the intercept. In the model selected by λ1se, four coefficients of

(30)

Figure 6: Lasso regression results for the fixed-effects Gaussian model.

the first and second columns of Table 11 in Appendix B.1. Note that the detailed specifica-tion of the fixed-effects Poisson model can be found in Secspecifica-tion 4.3. Furthermore, the values of the selected tuning parameters are given in the second column of Table 8 in Appendix A.8. As mentioned before, we will only interpret the point estimates of the best performing model later in this results section.

(31)

Fixed-effects negative binomial model

Figure 8 represents the lasso regression results for the fixed-effects negative binomial model. As can be seen in the left panel, for larger values of λ some coefficients become unstable. For example, the red line which represents the coefficient of the DayAct dummy variable fluctu-ates considerably. The panel on the right also shows that the standard deviation increases if the tuning parameter is increasing since the error bars are spread out over a wider range. Furthermore, we observe that the log-likelihood value is decreasing, becomes more negative, for higher values of the tuning parameter. This indicates that models selected by a higher value of λ for which larger standard deviations occur do not minimize the cross-validated error. Therefore, the area in which the coefficients are fluctuating is not of our interest. Selecting the optimal value λmin gives the original model which includes 95 variables

ex-cluding the intercept. In the model selected by λ1se two coefficients of product-specific

parameters are reduced to zero. This results in a sparse model that includes 93 variables without the intercept. Point estimates are provided for both models in the first and second columns of Table 12 in Appendix B.1. We will elaborate on the point estimates later in the results section. Note that the detailed specification of the fixed-effects negative bino-mial model can be found in Section 4.3. Furthermore, the estimated values of both selected tuning parameters can be found in the third column of Table 8 in Appendix A.8.

Figure 8: Lasso regression results for the fixed-effects negative binomial model

5.2 Prediction performance results

The test data of 28 days (from 11-09-2019 to 08-10-2019) is used for evaluation. In this section the prediction results of the naive method and of the penalized generalized fixed-effects regression models are shown. We compare model performance using the absolute error (AE), mean squared error (MSE) and scaled mean squared error (sMSE) as accuracy measures. Results are given for the full product selection and for four product bins. The products are divided into the bins based on the product-specific historical mean of sales volume (µi) relative to the distribution of the observed daily sales volume in the test data

set. Note that the sample mean in the test data is 7.0 sales per day. Bin 1 consists of the top 15 sold products for which holds µi≥ 11.1. Bin 2 corresponds to products in the range

of 7.0 ≤ µi < 11.1. Bin 3 includes products for which the following holds 2.2 ≤ µi < 7.

Products with very low daily sales volume (µi< 2.2) are captured in bin 4. An overview of

(32)

We distinguish between popularity since we are particularly interested in the model per-formances for low- and high-volume demand. Therefore, we examine the predictive perfor-mance of the models for the bins separately. The computed accuracy measures obtained by the seven different models are reported in Table 5. The columns Modelmin and Model1se

correspond to the fixed-effects regression models selected by λmin and λ1se, respectively.

The model with the best predictive performance per row is indicated by green.

Models

Naive Gausmin Gaus1se Poismin Pois1se NBmin NB1se

AE 24591 21750 20861 13091 13132 12808 12903 MSE 792.50 522.27 538.95 310.62 307.55 289.32 290.79 Overall sMSE 110.04 80.059 77.368 104.64 103.53 102.61 101.96 AE 11251 8691.8 8771.4 4304.2 4322.5 4145.9 4193.0 MSE 3118.3 2012.4 2140.7 597.69 590.97 486.91 500.93 Bin 1 sMSE 5.9608 3.5013 3.8460 2.1678 2.1298 1.8032 1.8678 AE 3668.3 1823.4 1673.2 1633.5 1616.3 1578.5 1566.8 MSE 455.64 129.53 134.17 112.67 110.80 120.14 119.84 Bin 2 sMSE 6.3478 1.7163 1.7666 1.4935 1.4553 1.5313 1.5255 AE 4388.2 5083.1 4745.3 2388.8 2392.3 2393.5 2442.9 MSE 62.452 78.999 83.359 20.779 20.701 21.771 22.910 Bin 3 sMSE 3.6339 6.9388 7.4291 1.1617 1.1736 1.1909 1.2481 AE 5284.4 6151.8 5670.8 4764.20 4801.1 4690.0 4700.4 MSE 646.42 467.88 445.38 624.01 618.42 609.65 605.49 Bin 4 sMSE 351.36 251.51 241.79 341.96 337.31 334.43 332.16

Table 5: Model performance comparison for full product selection and for four bins Note that the absolute errors cannot be compared across the bins since the number of prod-ucts differs per group. In general, we do not observe an extensive difference between the model performance for the models selected by the optimal and one standard error tuning parameter. This implies that increasing the bias with more than λmin does not significantly

decrease the variance and therefore the predictive performance will not be improved. Fur-thermore, we find that the naive method is always outperformed by at least one penalized generalized fixed-effects regression model.

Next, we will discuss the overall performance of the models that is shown in the first three rows in Table 5. As can be seen, the NBmin model has the best performance based on the

AE and MSE. However, the Gaus1seis the best performing model based on the sMSE. Recall

that the scaled version is divided by the product-specific historical mean of the quantity of daily sales. The scaling ensures that products with low- and high-volume sales become more equally valued in measuring model performance. This suggests that Gaus1semodel is more

accurate in predicting very low-volume demand than NBmin since less value is attached to

high-volume demand due to scaling. The model performance results for bin 4 are in line with this finding since the Gaus1se model outperforms other models based on two out of

three accuracy measures for products in bin 4. Further, we find that NBmin clearly obtains

(33)

As previously mentioned, we observe small differences between the predictive performances of models penalized by the two selected tuning parameters. In the previous section, the lasso regression results show that increasing the tuning parameter to λ1seonly has a considerable

effect in reducing model complexity for the Gaussian model. Therefore, we only consider the results of the fixed-effects Gaussian model selected by λ1se. For the fixed-effects Poisson

and negative binomial models, we focus on examining the models selected by λmin.

Figure 9 shows the relation between the actual quantity of daily sales and the predictions obtained by the four different models. Ideally, all points should be close to the red diago-nal line. However, we observe that most predictions deviate from the actual values. The dots primarily lie above the red line indicating that the models tend to under-forecast (pre-dictions are less than the actual quantity of daily sales). As can be seen, the naive and Gaus1se methods poorly predict high-volume demand. The highest prediction obtained by

Gaus1se is less than 100 although the highest actual value is around 350 sales. Further,

the Gaus1se model obtains proper predictions for low-volume demand since the dots

corre-sponding to low actual values lie relatively close to the diagonal line. These findings are consistent with prior findings based on accuracy measures. Furthermore, we do not observe a considerable difference between the predictive performances of the Poisson and negative binomial models. The dots of both models exhibit in similar patterns. The models obtain a few close-to-zero-valued predictions for positive-valued actuals. Furthermore, for higher predictions, the models seem to behave similarly. This is confirmed by the value of the Pearson correlation coefficient (R). The values (0.82 and 0.83) indicate that there is a fairly strong positive linear relationship between the actuals and the predictions obtained by the Poismin and NBmin models. In summary, Figure 9 shows that the models based on count

distributions perform better than either the naive method and the Gaussian model. Except for predictions of products with very low-volume demand, the Gaus1se model seems to

ob-tain the highest predictive performance. Further, the estimates of the quantity of daily sales deviate from the actuals. Therefore, we proceed to conduct residual analysis.

(34)

Figure 10 shows the predictions against the residuals for the four different models. Het-eroscedasticity often occurs when there is a wide range between the lowest and highest value of the observations. Typically, the pattern for heteroscedasticity is that as the predictions increases, the variance of the residuals also increases. It can be seen that there is an evident pattern for heteroscedasticity. However, there is also a high variation in the errors around close-to-zero valued predictions. Note that the presence of heteroscedasticity leads to mis-leading and incorrect standard errors. This can affect interval estimation and hypothesis testing. Since our main goal is prediction rather than the interpretation of the specific ef-fects of the independent variables, we do not correct for the non-constant variance. Further research must be executed to investigate whether models with robust standard errors are more appropriate. More detailed graphs with predictions obtained by the three fixed-effects generalized linear models can be found in Appendix A.10.

Figure 10: Residuals against the predictions obtained by the four different models

5.3 Results per bin

Based on the accuracy measures and correlation plots examined in the previous section it seems that model performance really differs per bin. Therefore, we review the performance results per bin in more detail. Figure 11 shows the actual quantity of daily sales per bin and the predictions obtained by the four different models. Note that the y-axis values are hidden for confidential reasons and can be provided upon request. The high peak observable on the 16th of September is caused by a daily deal for Robijn detergents. Further, the 10-days bulk period captures the dates from 23th of September to the 2nd of October where a period of high sales volume observed.

The graph corresponding to the total quantity of daily sales of bin 1 shows that the NBmin

model outperforms the other models. There is a slight difference with the predictions ob-tained by the Poismin model although the negative binomial is somewhat more accurate

(35)

demand which is consistent with prior findings. As can be seen in the graph corresponding to bin 2, the sum of the predictions obtained by the three penalized fixed-effects models are approximately equal to each other. Only during the daily deal on the 16th of September, the predictions obtained by Gaus1se noticeably deviates from the actual values. In the graph

corresponding to bin 3, the NBmin and the Poismin models do obtain similar results during

the bulk period. Furthermore, both models do not cover the peak during the daily deal. The Gaus1se model over-predicts the products during the bulk and underestimates the sales

volume outside bulk periods. Therefore, in line with the findings mentioned in the previous section, the NBmin and the Poismin models are both preferred over the Gaus1se model for

products in bin 2 and bin 3. Considering the results for bin 4, the NBmin and the Poismin

models cannot capture the sales pattern and both are under-predicting sales volume. The Gaus1semodel can reasonably predict the total sales volume during the bulk period. Besides

the bulk period, the model tends to under-forecast the quantity of daily sales for products captured in bin 4.

From the results per bin, we can conclude that the models based on count distributions are more beneficial for all products except for products with very low-volume demand. The penalized fixed-effects negative binomial obtains a little higher predictive performance than the Possion form. Therefore, the assumption that sales follow a negative binomial distribu-tion is slightly preferred over the Poisson distribudistribu-tion. The advantage of using the negative binomial over the Poisson distribution is not of the same magnitude as described in Snyder et al. (2012). Furthermore, it seems that for products with very low-volume demand that are captured in bin 4, the Gaus1se model performs better.

Figure 11: Actual values and predictions of daily sales volume per bin

Forecasting with generalized ﬁxed-eﬀects regression models using L1-regularization