• No results found

The performance of machine learning algorithms in a FMCG setting

N/A
N/A
Protected

Academic year: 2021

Share "The performance of machine learning algorithms in a FMCG setting"

Copied!
66
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

The performance of machine learning algorithms in a

FMCG setting

Measuring online and offline promotional effectiveness

Master Thesis

J.J. Westhoeve

University of Groningen

Faculty Economics and Business

MSc. Marketing Intelligence and Management

January 15, 2017

Student number: s2204347

Reitemakersrijge 6-34

9711 HT Groningen

+31683166217

j.j.westhoeve@student.rug.nl

First supervisor: prof. dr. J.E. Wieringa

(2)

Management Summary

Firms are able to reach their customers by a great variety of marketing instruments nowadays. The contribution of different online and offline advertising instruments and the effectiveness of marketing channels has never been more important than it is now. The aim of this study is two-fold. First, it assesses the online and offline promotional effectiveness and possible cross-media synergies for a low-involvement product in a FMCG setting. Increased knowledge about the contribution of different advertising instruments and marketing channels reduces inefficiencies when optimizing the media mix in the future. Second, a comparative analysis of different predictive modeling techniques is performed to see which supervised learning method performs best in this particular setting. This analysis includes: linear regression, negative binomial regression, neural networks, support vector machines, decision trees, bagging, boosting and random forests. These methods allow for a comparison between generalized linear models, machine learning techniques and ensemble learning.

Panel data is obtained to investigate the contribution of different advertising instruments. In this case the following instruments are included: YouTube, online display and television. The data describes the buying behavior of soft drinks of households in the Netherlands over a three-month period. It also describes the media consumption behavior of households, consisting of an online and an offline panel. For analysis, three models are created: the online panel model, the offline panel model and the both panels model. Together with three subsets of data, the supervised learning methods are performed.

This research investigates the purchase behavior of households by looking at the weekly number of purchases of households. It becomes evident that price and promotion are the key influencers for all three models. In this study, no cross-media synergies are found. Only for the online panel model, YouTube advertisements are positively influencing the number of purchases of households on a weekly basis. Furthermore, no advertising instruments are found significant.

To compare the different predictive modelling techniques, two different performance measures are used: root mean square error (RMSE) and the mean absolute error (MAE). This study shows that advanced techniques are more accurate than generalized linear models. Especially, ensemble methods perform well in combination with decision trees. Random forests is the most accurate ensemble method to predict the purchase behavior of a low-involvement product in a FMCG setting, followed by the boosting and bagging method. This study shows that shifting from regression to the best performing ensemble method increases the performance by at least 32% and up to 83%. Therefore, it is recommended to use ensemble methods to address similar marketing problems in the future.

Keywords: Generalized Linear Models, Machine Learning Algorithms, Ensemble Learning, Predictive

(3)

Preface

During my bachelor’s degree in Business Administration, I discovered my enthusiasm for marketing and statistics. My choice for the Marketing Intelligence and Management masters was therefore an easy one. No regrets have come to mind concerning this choice, and in particular the intelligence courses were truly interesting and challenging. I believe that you need to go the extra mile to improve your capabilities.

I was really inspired by the new Data Science and Marketing Analytics course this year, and the approach that was used. Big data and machine learning algorithms fascinated me and they are a real challenging way to address marketing problems in the digital millennium we are currently living in. The data science course really triggered me to use machine learning in my thesis and provided a right theoretical basis.

A special thanks to prof. dr. Jaap Wieringa for his support during the process of writing my master’s thesis, and the design of the statistical scripts. Furthermore, many thanks to my fellow students, family and girlfriend who provided me with feedback, help and support during my thesis.

(4)

Table of Contents

Management Summary ... 2

Preface ... 3

1

Introduction ... 5

2

Theoretical Framework ... 8

2.1 The customer journey ... 8

2.2 Online advertisement ... 9

2.3 Offline advertisement ... 11

2.4 Cross-media synergies ... 12

2.5 Price and promotions ... 14

2.6 Control variables ... 15

2.7 Conceptual model ... 17

3

Data ... 18

3.1 Data description ... 18

3.2 Outliers ... 19

3.3 Missing values and imputation ... 19

3.4 Variable descriptions ... 20

4

Methodology... 23

4.1 Models ... 23

4.2 Sampling ... 23

4.3 Generalized linear models ... 24

4.4 Machine learning techniques ... 25

4.5 Ensemble learning ... 26

4.6 Performance measures ... 28

4.7 Software package ... 29

5

Results... 30

5.1 Negative binomial regression ... 30

5.2 Multicollinearity ... 31

5.3 Machine learning techniques ... 32

5.4 Ensemble methods ... 34

5.5 Performance measures ... 35

6

Discussion ... 37

7

Conclusion and Recommendations ... 40

8

Limitations and Future Research ... 41

(5)

1

Introduction

The importance of allocating the right budget to the right marketing instruments is gaining importance in the marketing literature in the recent years. The innovativeness of marketing departments to stay relevant and influential is crucial (Verhoef & Leeflang, 2009). Firms increase their technical marketing skills to make optimal use of the exploding amount of online and mobile media channels (Batra & Keller, 2016). Increasing multiple touchpoints in the customer journey is of paramount importance and more advanced advertising such as sponsored search, affiliates, online displays and social media are complementing the media mix (Anderl et al., 2014; Lin et al., 2013). Due to this phenomenon the Marketing Science Institute (MSI) made assessing spillover effects and attribution modeling of online and offline channels their number 1 research priority for the coming two years (MSI, 2016).

Potential customers are confronted to many new forms of advertising. In all of their touchpoints the problem arises how big the contribution each of the online and offline advertising instrument is and should be. According to de Haan et al. (2014) integration of online and offline advertising instruments can yield up to a 21% revenue increase over the status quo when it is properly arranged. Despite of the rise of the internet, many advertisers are still very restrained in shifting their budgets from traditional TV advertising to the Internet (Draganska et al., 2014). Recently, Google published a meta-analysis of 56 case studies, stressing the importance of online video advertisement on YouTube over the traditional TV advertisements. According to Google, up to 80 percent of the YouTube ads were way more effective than the TV ads in driving sales (Sweney, 2016). Google claims that shifting budgets to YouTube ads would give companies a better return on investment.

According to Wildner & Modenbach (2015) TV still pays off in a digital world and according to Sayedi et al. (2014), TV advertising still has a big market share. However, a shift to online advertising is recognized and the world of linear television is changing. Media consumption changed drastically due to the increase of interactive television and binge watching (Schweidel & Moe, 2016). By using interactive television, customers are able to skip ads and consume more content in less time. Binge watching is a relatively new phenomenon where individuals watch multiple episodes without stopping. Payed online video services such as Netflix are gaining popularity since customers do not have to deal with any form of advertisements. In 2015, the number of people watching linear TV in the US has decreased by 12 percent due to online video services (Hern, 2015).

This paper investigates the importance of online and offline marketing instruments for a low-involvement product: soft drinks in a fast moving consumer goods (FMCG) setting. Panel data observing households is obtained to examine possible cross-media effects of different online and offline instruments: YouTube, online display and television. To assess the importance of the different marketing instruments a comparative analysis of predictive modelling techniques is performed.

All methods can be labeled as supervised learning techniques where: “the algorithm is trained

(6)

linear regressions are widely applied in the marketing literature and assume a normal distributed error term. This works well for continuous dependent variables. In this research counts are modeled, where a Poisson distribution is better suited (Leeflang et al., 2015; Witten & Frank, 2005). Since the Poisson regression assumptions are violated, a more flexible extension is used: negative binomial regression (Blattberg et al., 2008).

By introducing machine learning algorithms such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM) more sophisticated, less restrictive and less formal statistical learning methods are included with a great potential for prediction and classification (Blattberg et al., 2008; Cui & Curry, 2005; James et al., 2013; Kübler et al., 2016). Furthermore, decision trees are applied. A big advantage of decision trees is the simplicity and flexibility (Elith et al., 2008). Since these trees are supposed not to be able to compete with the more sophisticated statistical methods in this paper, it is applied together with ensemble learning techniques which improves the prediction accuracy by combining multiple trees (James et al., 2013). Bagging, boosting and random forests combined with decision trees are used to accomplish a complete comparison between traditional regression methods, machine learning techniques and ensemble learning. Bagging and boosting often result in better classifications than a logistic regression (Lemmens & Croux, 2006) and the predictive power of random forests is competitive with bagging and boosting (Breiman, 2001).

The purpose and the goal of this paper is two-fold. First, the importance of different instruments is assessed. Therefore, the following research question is formulated: What is the importance of online

and offline marketing instruments on the purchase behavior of a low-involvement product in a FMCG setting? Secondly, to answer this question multiple modelling techniques are used. This results in the

following research question: Which supervised learning method performs best in predicting the

purchase behavior of a low-involvement product in a FMCG setting?

This paper contributes to the marketing literature in multiple ways. First, previous researchers approached and examined online and offline instruments thoroughly by using mainly linear modelling techniques such as generalized linear models. Most recently Srinivasan et al. (2016) explored the cross-media effects by using a Vector Autoregressive (VAR) model to determine the influence of online consumer activities and the influence of traditional marketing actions in a FMCG setting. By comparing very efficient and powerful machine learning techniques, a new approach of studying complex relationships in a world of synergetic effects between online and offline marketing instruments is applied in a world of FMCG and is therefore contributing to the current literature. Secondly, inefficiencies can be reduced by advertisers in the future by optimizing their media mix decisions between online and offline marketing instruments. Furthermore, by looking at the number of weekly purchases of households, a different way of predicting purchase behavior is used.

(7)
(8)

2

Theoretical Framework

2.1

The customer journey

The attribution of marketing instruments and determining the effectiveness of marketing channels are the point of interest for researchers for quite some years now. To obtain optimal results within the company, the capabilities of determining effectiveness of individual marketing channels is crucial (Anderl et al., 2016). Due to the massive increase of instruments, a multichannel environment is created where the interplay of online and offline instruments has never been more important than it is now (Anderl et al., 2016; Kannan et al., 2016). In this new environment customers come in contact with way more purchase and communication options than before. Therefore, multichannel customer management is essential in optimizing a marketing strategy (Ansari et al., 2008).

According to Van der Veen & Van Ossenbruggen (2015) three benefits can be achieved by performing a multichannel strategy. First, it is more cost efficient than performing a single channel strategy. Secondly, optimizing the distribution network leads to many more prospective customers. Thirdly, the needs of the customer are optimally accommodated. By using various communication and advertising channels and allocating the right amount of budget is challenging and very complex (Fischer et al., 2011).

According to Srinivasan et al. (2016) the linear consumer process as we know it (a funnel) is obsolete and no longer relevant. Consumers follow a so-called path to purchase (P2P) on their way to conversion (Batra & Keller, 2016). A funnel is replaced by a more complex network structure, replacing the linear and straight line purchase process. This new structure is often called a customer journey in the marketing literature, describing all the touchpoints customers face with a company in their process towards a purchase (Anderl et al., 2016; Kireyev et al., 2016; Srinivasan et al., 2016).

(9)

computer science. According to Kannan et al. (2016) the actual implementation and effective attribution of marketing instruments nowadays depends on the effective implementation of big data and real-time analytics. Big data can be defined as: “massive data sets having large, more varied and complex

structure with the difficulties of storing, analyzing and visualizing for further processes or results”

(Sagiroglu & Sinanc, 2013). By using big data, hidden patterns and deeper insights can be gained to increase advantage over competitors (Sagiroglu & Sinanc, 2013). Appropriate analyses create great opportunities in predicting the likelihood of attracting prospective and current customers (Kübler et al., 2016). By making use of big data sources from for example social networks, companies are better able to identify their customers and optimize their path to purchase.

It can be concluded that online advertising is gaining importance compared to offline advertising but that both are needed for effectively reaching the customer during the customer journey. According to Srinivasan et al. (2016) online consumer activity is still affected by the traditional offline activities in the customer journey. They conclude that although the influence of online instruments is greater than for example TV advertising, both are needed to drive effective sales for FMCGs.

2.2

Online advertisement

Consumers are spending much more time on the internet nowadays and therefore consume much more media content online. Companies are increasingly relying on online advertising. Online advertising has an increasing share of the total advertising market (Goldfarb, 2014). Due to the complexity of many online channels, measuring the contribution of each online instrument is so-called success demanding (Anderl et al., 2016). A wide variety of marketing channels are being deployed to reach the desired customers.

According to Srinivasan et al. (2016) the contribution of online instruments can be separated in three types of media: owned, earned and paid media. Owned media are channels that firms own and control. This could be a firm’s website that customers visit. Earned media are channels that are less controlled. Different parties are used to reach the customers and to communicate with them. Earned media can result in positive feedback companies receive via for example social media channels. Paid media are the channels where firms pay for a variety of advertising media.

In many ways online advertisement is different from traditional media. According to Stolyarova & Rialp (2014), it is more efficient than offline instruments for a number of reasons. First of all, it is more cost efficient than traditional media. A big advantage is the increase of target possibilities to maximize the right audience as cost efficient as possible. Increased online activities also reach the customer faster and more often than traditional instruments. Furthermore, it is very interactive. Customers can interact with the firm and also they control the process when and how to react. Lastly, internet marketing is different from traditional marketing since it is better and easier measurable.

(10)

cost per acquisition (CPA), conversion rates and impression tools (Braun & Moe, 2013; Kireyev et al., 2016). These capabilities lead to great opportunities for firms to monitor conversations of customers on social media for example. Measuring attitudes towards their brand and engagement is way more efficient than using classical surveys (Srinivasan et al., 2016). For advertisers these online metrics are more useful in justifying their decisions and expenditures on advertisements compared to traditional media such as television and radio (Kireyev et al., 2016).

According to Goldfarb (2014), online advertising can be divided into three general categories: search advertising, classified advertising and display advertising. These categories fit best in the paid media category of Srinivasan et al. (2016).

Search advertising contains mostly advertising on search engines such as Google and Bing. Auction mechanisms determine the price of an ad. Only when someone clicks on the ad, this price is paid and is called a cost-per-click (CPC) (Goldfarb, 2014). Search advertising is a challenging task for firms since it contains a lot of targeting possibilities. Customers that reveal their interest by searching on specific company related key terms can be targeted. Advertisers are able to show their ads at the exact moment prospective customers are searching for something on for example Google (Goldfarb, 2014; Sayedi et al., 2014). Furthermore, firms are able to target customers of competitors by advertising on the keywords of competing firms. This phenomenon is called poaching and makes use of the market share of competitors (Sayedi et al., 2014).

Classified advertising is the type of advertising that appear on websites, where no other sorts of media content are displayed. Foremost online job sites and dating sites are contained in this category (Goldfarb, 2014). One of the most important categories and widely used online instrument is display advertising. Besides search advertising, display advertising is the main revenue generator for firms (Goldfarb, 2014). This includes typical social media ads, media-rich ads on websites, banner ads and plain text ads. In a FMCG setting Srinivasan et al. (2016) found a significant relationship for online advertisements and it accounts for 15 % of the sales variation. The study of Danaher & Dagger (2013) revealed that search advertising is most influential in driving the purchase behavior of customers.

A shift is recognized towards many effective forms of online advertising since online media consumption is still increasing. The different online instruments are categorized in multiple ways in the literature. Based on the categorization and the findings related to search advertising, classified advertising and display advertising, the following hypotheses are formulated:

H1a: Search advertising will positively influence the purchase of a low-involvement product in a FMCG

setting.

H1b: Classified advertising will positively influence the purchase of a low-involvement product in a

FMCG setting.

(11)

H1d: Search advertising will have a greater positive influence than classified and online display

advertisements on the purchase of a low-involvement product in a FMCG setting.

Since advertisers are able to control online instruments and media types more closely, a shift can be recognized towards online targeted media options (Li & Lo, 2015). A possible change is the shift from television advertising towards online video advertisements. Firms like the fact that online video advertisements are very cost efficient compared to television advertisements. Also the extensive reach of online video advertisements on for example YouTube is a big advantage for firms (Shehu et al., 2016). YouTube is one of the biggest social media platforms with over one billion active users (YouTube, 2016). Researchers of Google Inc. claim that YouTube is a very effective way of advertising (Goerg et al., 2015; Wattenhofer et al., 2012; Jin et al., 2013). Due to the extensive reach of online video advertisements and the cost-efficiency as argued by the literature, the following hypotheses are formulated:

H1e: Online video advertisements will positively influence the purchase of a low-involvement product in

a FMCG setting.

H1f: Online video advertisements will have a greater positive influence than TV advertisements on the

purchase of a low-involvement product in a FMCG setting.

Due to the increased communication options also for low-involvement products, it is very likely that consumers will learn about the product before buying it. In their path to purchase a combination of online media, TV, social connections or other stimuli is very likely (Srinivasan et al., 2016). In general, the internet does not have a strong influence on the creation of brand value. The influence of traditional media such as TV is much stronger as is discussed in the next paragraph (Stolyarova & Rialp, 2014).

2.3

Offline advertisement

Traditionally, advertisements were initiated by firms to push a message towards the customers, the so-called FICs (Wiesel et al., 2011). Offline advertising can be classified in two types: mass media and individually-target media (Naik & Peters, 2009). Mass media includes TV, radio and print such as newspapers and magazines. From the traditional instruments only direct mail is individually targeted. Where online advertisement foremost focuses on the immediate measurement of performance, offline advertisement mostly focuses on accomplishing long-term effects and building a brand (Braun & Moe, 2013).

(12)

the fact that TV is most effective in building a brand. They found that the long-term ROI of TV advertising is a factor of 2.65 for all investigated brands. This is also in line with the research of Stolyarova & Rialp (2014). They claim that generally speaking, TV is still the most efficient offline medium, followed by radio. Print appeared to be the least efficient. It also supports an earlier comparison between TV and other media types, including internet advertising. According to Dijkstra et al. (2005) TV advertisements are able to use a lot of stimuli such as visuals together with audio to provoke cognitive responses. To measure cognitive responses, measures like brand and ad recall are often used to see whether customers are able to recall the ad or brand after seeing it. TV is superior in building a brand since these advertisements are able to establish the biggest ability to recall the brand after seeing the TV ad (Draganska et al., 2014). Also, the links of the message are best processed in a TV ad. Subsequently, these brand building strengths of TV ads are much bigger than online advertisements in any form. According to Srinivasan et al. (2016) TV advertisements have a significant influence on the sales of FMCGs and accounts for 5% of the sales variation. Due to the great long-term effects and the brand building characteristics the following hypotheses are formulated:

H2a: Mass media advertisements (TV, radio and print) will positively influence the purchase of a

low-involvement product in a FMCG setting.

H2b: Individually-target media advertisements will positively influence the purchase of a

low-involvement product in a FMCG setting.

H2c: TV advertisements will have a greater positive influence than other traditional advertisements on

the purchase of a low-involvement product in a FMCG setting.

2.4

Cross-media synergies

Since advertising is so important for many firms, employing their media mix as efficient as possible using many different channels is essential. Customers may visit multiple channels before they purchase a product at the end of the funnel (Anderl et al., 2016; Li & Kannan, 2014). Different channels are so-called interdependent and lean on another (Van der Veen & Van Ossenbruggen, 2015). A distinction can be made by defining the possible complementary effects concerning different channels between carryover effects and spillover effects. According to Anderl et al. (2016), in case of carryover effects the channel used by customers affected the purchase decision in the same channel. In case of spillover effects different channels have affected the purchase decision of a potential customer.

Important is the coordination between different channels when performing a multichannel strategy. The effect from the exposure of the different media is called a synergy effect (Chang & Thorson, 2004). According to Naik & Raman (2003) synergy can be defined as follows: “the combined

effect of multiple activities exceeds the sum of their individual effects”. To build synergy, it is essential

(13)

Valkenburg, 2014). Cross-media campaigns are created to “maximize the effectiveness of their budgets

by exploiting the unique strengths of each medium” (Voorveld et al., 2011). For firms a combination of

different instruments is essential to effectively communicate a message to potential customers. Compared to a single instrument strategy, it can create extra effects and improve the performance of the media mix if well performed (Voorveld & Valkenburg, 2014). According to Danaher & Dagger (2013) the strategy of using multiple media forms is better than a simple media strategy. They claim that cross-media synergies exist and if online activities are performed alongside traditional instruments, it can significantly influence purchases.

According to Dinner et al. (2014) the cross-media synergies are very large. Especially the effect of online advertising to offline sales is interesting. Online advertising is more than a tool and can effectively grow the offline sales. This is supported in a FMCG setting where products are mostly purchased offline. According to Srinivasan et al. (2016) online metrics do significantly interact with each other but are also significantly affected by traditional communication channels. Together, this results in higher sales than without the interaction of both channels. This also hold for more specific relations between online and offline media such as television and web in general. According to Srinivasan et al. (2016) a predominant flow exists between TV and online instruments, moving eventually towards a purchase. This confirms the gut feeling of advertisers to keep spending budget on TV advertisements and is in line with the findings of Chang & Thorson (2004) who found that TV and web synergy can lead to a higher attention and a higher perceived message credibility. More specifically Kumar et al. (2013) found a significant relationship between social media and offline sales. As a consequence, the following hypotheses are formulated:

H3a: Combining online and offline instruments will positively influence the purchase of a

low-involvement product in a FMCG setting.

H3b: Combining online instruments will positively influence the purchase of a low-involvement

product in a FMCG setting.

H3c: Combining offline instruments will positively influence the purchase of a low-involvement

product in a FMCG setting.

(14)

2.5

Price and promotions

As argued, the purchase behavior can be influenced by many different advertising instruments. In addition to the significant effect of online and offline advertising, price remains an important influencer of a purchase (Singh, 2015). This is in line with the research of Srinivasan et al. (2016) who claim that although online and offline advertisement contribute significantly to the sales, still 20% of the sales variance can be related to pricing. This is expected since FMCGs are, as noted earlier, products with a low-involvement nature.

How much time and effort a customer spends in their decision to buy a product distinguishes high-involvement products from low-involvement products (Nagar, 2015). Under higher involvement situations consumers often show great interest in information search, attribute comparison and have strong preference for certain brands, whereas under low-involvement situations consumers are less motivated to find out more about a brand or a product (Nagar, 2015). The difference between both can be explained by the Elaboration Likelihood Model (ELM) introduced by Petty et al. (1983). This model describes how attitudes are formed resulting in two routes towards persuasion. First, high-involvement products are often processed via the central route. Via this route, customers are able to assess the true merits of a product and consider carefully the complete product. Second, this is not the case for low-involvement products. Customers mainly evaluate the products based on positive and negative cues. This is called the peripheral route and only simple inferences determine the attitude towards a product. Wakefield & Inman (2003) argue that greater product involvement leads to higher loyalty and lower price sensitivity, indicating that for FMCGs consumers are less loyal and more price sensitive. They defined price sensitive as: “the extent to which individuals perceive and respond to changes or

differences in prices for products or services” (Wakefield & Inman, 2003). Highly price sensitive

customers are well aware of their preferences and make a purchase more frequently compared to less price sensitive customers (Kim & Rossi, 1994).

(15)

Due to the fact that FMCGs are often products with a low-involvement nature, the incorporation of the promotion and price variable is inevitable. Therefore, following hypotheses are formulated:

H4a: A lower price will positively influence the purchase of a low-involvement product in a FMCG

setting.

H4b: Promotions will positively influence the purchase of low-involvement product in a FMCG setting.

H5a: A lower price combined with promotions will positively influence the purchase of a

low-involvement product in a FMCG setting.

H5b: A lower price combined with advertising will positively influence the purchase of a

low-involvement product in a FMCG setting.

H5c: Promotions combined with advertising will positively influence the purchase of a low-involvement

product in a FMCG setting.

2.6

Control variables

Besides the fact that the purchase behavior of consumers is influenced by advertising stimuli, price and promotions, other factors are likely to influence the purchase behavior of households in a FMCG setting. Socio-demographic factors certainly influence the consumption behavior of low-involvement products and therefore, it will influence the buying behavior of households. In case of socio-demographics, the influence may differ across different product types and is therefore of great importance. Furthermore, it is argued for FMCGs that a highly competitive environment exists. Therefore, the buying behavior of competitor products must be taken into account, since it may influence the purchase behavior of the investigated brand.

Since the topic of this paper can be narrowed down to the subject of soft drinks, several socio-demographical variables must be taken into account since they may have influenced the purchase and consumption intention of households. An earlier study by Monsivais & Drewnowski (2009) controls for four socio-demographical variables: age, income, education and household size.

Age is a very important variable to control for since the influence of age is likely to differ between product categories. Soft drinks are top of mind for youngsters according to Subhasis & Sanjukta (2009). They are likely to buy and consume these products in their social environment every day. Especially in case of institutions with canteens, such as colleges and schools, youngsters are continuously facing the temptations of soft drinks. The case study by Subhasis & Sanjukta (2009) shows that younger people (16-24 years) consume much more soft drinks. This is much higher than the normal consumption of soft drinks. This is supported by later research of Han & Powell (2013). They claim that especially adolescents and young adults are most heavily consuming soft drinks.

(16)

and also higher educated people are likely to consume less sugar sweetened beverages. According to Han & Powell (2013) a lower education results in higher odds of soft drink consumption. These findings are also in line with research in the fast-food area where the consumption of fast-food diminishes at the highest levels of education (Paeratakul et al., 2003).

In case of income, households with a higher income are likely to spend more money on their diet and less on sugar sweetened beverages (Monsivais & Drewnowski, 2009). Therefore, their diet quality increases compared to lower income households. Han & Powell (2013) show that sugar-sweetened beverages are more often consumed by low income customers compared to customers with a high income.

Lastly, the model is controlled for household size. Household size is likely to influence the rate of consumptions due to the fact that bigger households are likely to consume more (Monsivais & Drewnowski, 2009). This is confirmed by Paeratakul et al. (2003) where bigger households consume more fast-food than smaller households.

Based on current findings related to age, income, education and household size in a similar setting, the following hypotheses are formulated to control for the effects of socio-demographical variables:

H6a: A lower age will positively influence the purchase of a low-involvement product in a FMCG setting.

H6b: A lower household income will positively influence the purchase of a low-involvement product in

a FMCG setting.

H6c: A lower educational level will positively influence the purchase of a low-involvement product in a

FMCG setting.

H6d: A larger household size will positively influence the purchase of a low-involvement product in a

FMCG setting.

Furthermore, a great number of competitors exists in a FMCG setting. The pressure of competitors and the rivalry must be taken into account in great detail (Kitchen, 1989). Due to the fact that consumers are often less loyal in FMCG setting, it is reasonable to assume that switching to competitor products will decrease the purchase intention of the brand of interest. Therefore, the following hypothesis is formulated:

H7: Competitor purchases will negatively influence the purchase of a low-involvement product in a

(17)

2.7

Conceptual model

All the hypotheses that have been formed are visually represented by a conceptual model (Figure 1). This model encompasses the advertising instruments, price, promotion and the control variables that are hypothesized related to the purchase of a low-involvement product in a FMCG setting.

FIGURE 1: Conceptual model

(18)

3

Data

3.1

Data description

To explore the differences between the online and offline instruments and the possible synergetic effects, data is acquired, containing different consumer panels. The data describes the purchase behavior of 10.703 households in the Netherlands over a three-month period. The measurements started on the 31st

of December 2013 until the 29th of March 2014. During this period on a daily level the purchase behavior

of a soft drink brand is measured. This is the so-called product consumption panel including all participating households. During the measurements 12.118 purchases are completed. In total 48.753 soft drink units are bought. On average, four items are bought each time a purchase is made. More detailed information on the purchases can be found in Appendix A1.

Second, socio-demographics are acquired resulting in information on the participation in different media consumption panels. For measuring the online and offline media consumption, two panels can be separated. First, the offline panel, where 13,5% of the households participated in a passive TV panel. Secondly, the online panel, where 87,6% of the households participated in a panel regarding online display and YouTube ads. Not all households participated in one of the panels, but 76,8% did participate in at least one of the two panels (Table 1). Furthermore, different aspects of the households regarding household composition are included in the data such as age, income, size of household and education level.

TABLE 1: Panel participation of households

The data is measured on a daily level basis, meaning that the level of aggregation can be arranged in multiple ways. For further analyses the choice is made to use a weekly aggregation level due to the fact that most households buy groceries only a limited number of times every week. For interpretational purposes and statistical power, a weekly aggregation level is preferred. This aggregation level still results in a fairly number of records that is required for a realistic comparison between the sophisticated machine learning techniques. Compared to classical statistical modelling techniques, machine learning techniques are better able to process bigger amounts of complex data (Kübler et al., 2016).

Panel Percentage Number of households

Offline panel 13,5% 1443

Online panel 87,6% 9380

Both 12,2% 1304

Either one 76,8% 8215

(19)

3.2

Outliers

An important aspect in preparing the data for analysis is the detection of outliers and flaws. For the product consumption data some possible outliers are detected. For instance, one household bought two times 100 units of the particular soft drink. However, these purchases are not made in the same month and since some household may buy more than others, these observations will not be removed from the data. Furthermore, only one household bought twice a soft drink for 3 euro’s or more. However, some FMCG stores are higher priced and therefore this outlier is not excluded from the analyses. Furthermore, no more outliers are detected in the product consumption data.

In case of the socio-demographics some flaws are detected. For 13 households, characteristics were missing. This problem is solved by using multiple imputation. This is described and motivated in detail in the next paragraph.

3.3

Missing values and imputation

Since the purpose of this analyses is to describe the insights of the data as good as possible. Handling missing values is essential to prevent inefficient analyses and biased estimates (Donders et al., 2006). Missing data results in a partial loss of information and therefore imputation methods are needed to prevent a potential loss of statistical power (Schafer & Graham, 2002). As mentioned, for 13 households socio-demographics were missing. A table with the specific cases is added in Appendix B1. To prevent a diminished sample size by 1.170 records, imputation is of great importance.

Imputation can be defined as “replacing that missing by a value that is drawn from an estimate

of the distribution of this variable” (Donders et al., 2006). According to the literature, sophisticated

imputation methods are more powerful than simple techniques such as mean imputation and list wise or pairwise deletion. Multiple imputation uses multiple datasets to derive the estimates from and is based on a joint normality assumption (Blattberg et al., 2008; Donders et al., 2006; Schafer & Graham, 2002).

Multiple imputation is necessary to obtain correct estimates and p-values. Donders et al. (2006) state that multiple imputation only lead to unbiased results when missing data is Missing Completely at Random (MCAR) or Missing at Random (MAR). This means that in case of MCAR the reason that a value is missing is completely random and for MAR the probability that a value is missing is unrelated to other characteristics in the data.

(20)

Multiple imputation is performed by using widely accepted tool in R studio: multivariate imputation by chained equations (MICE) as proposed by van Buuren & Groothuis-Oudshoorn (2011). During the imputation, the default setting is used of five multiple imputations. Furthermore, since the missing socio-demographics are all factors with more than 2 levels, the function polyreg is used. Polyreg is suited for this type of variables and uses a multinomial model to estimate the missing values (van Buuren & Groothuis-Oudshoorn, 2011).

3.4

Variable descriptions

The comparative analysis that is performed includes online and offline instruments, the variables price and promotion and several control variables to see what influences the purchase behavior in a FMCG setting. To compare different supervised learning techniques a count variable is selected as dependent variable. This variable is the number of purchases and indicates how many purchases a household did in a particular week. Multiple independent variables are included to reveal meaningful patterns in the acquired panel data. Table 2 summarizes all the variables that are used for the analyses.

TABLE 2: Variable names, description and type

With the acquired data not all the hypotheses that are formulated are tested. A limited number of advertising instruments is available for the analysis: YouTube, online display and television. Most of the hypotheses are tested with these instruments and can be found in Table 3.

TABLE 3: Hypotheses testing

Variable Variable name Description Variable Type

Purchases Purchase_sum Number of purchases in week t Continuous

YouTube YT_contacts Number of contacts with YouTube ad in week t Continuous

TV TV_contacts Number of contacts with TV ad in week t Continuous

Online Display Display_contacts Number of contacts with Online Display ad in week t Continuous

Price Price The average price per unit in week t Continuous

Promotion Promotion Number of purchases in promotion in week t Continuous

Competitor Purchases CompetitorPurchases Number of competitor purchases in week t Continuous

Age Age_of_Housewife Age group of the housewife of the household in week t Ordinal

Income Household_net_income Net income group of the household in week t Ordinal

Education Education_Level Education level of the household in week t Ordinal

Size Household HH_size The size of the household in week t Ordinal

Hypotheses Tested Not-Tested Hypotheses Tested Not-tested

H1 H1c, H1e, H1f H1a, H1b, H1d H5 H5a, H5b, H5c -

H2 H2a (only TV) H2b, H2c H6 H6a, H6b, H6c, H6d -

H3 H3a, H3b H3c H7 H7 -

(21)

The online and offline media panel participation is a requirement for measuring the effect of the particular online and offline instruments in the analysis. The offline panel reflects the number of TV ads seen on a weekly level. During the three-month period, in total 9487 times a TV ad is seen with a maximum of five ads a day for one household. The online panel reflects the number of YouTube and online display ads seen on a weekly level. In this panel, a YouTube ad is 984 times seen with a maximum of two times a day for one household. Online displays ads are 453 times showed to all households, with a maximum of two display ads a day for one household. Including TV, YouTube and online display advertisements empowers a clear comparison between a traditional offline instrument and two more sophisticated online instruments.

The online and offline variables are used to construct cross-media synergy variables. Naik & Raman (2003) describe synergy as the combined effect of multiple activities. First, a variable is created reflecting the joint effect of online and offline instruments. In this case the contacts of TV, YouTube and online display are combined. Subsequently, an online synergy variable is created that combines the YouTube and online display variable. The construction of these variables is key for measuring possible cross-media synergetic effects.

A possible danger of including the combined effect of the individual instruments is multicollinearity. Predictors may capture the same effect and it occurs that variables are highly correlated (James et al., 2013). An overview of highly correlated predictors is added as Appendix B2. In this particular case the synergy variables are significantly correlated with the individual variables. Due to the fact that the predictors will blow up the variance and covariance in case of multicollinearity, the estimates are unreliable (Leeflang et al., 2015). However, the models are performed including and excluding the different interactions to see what the influence is of the interactions on the estimates of the model.

Furthermore, not only instruments may influence the purchase behavior. First, price is included in the analysis, indicating the price per unit that a household paid in a certain week. On average the price per unit was €1,31 for a soft drink. A higher price per unit is highly positively correlated with larger units (Appendix B2). This means that if the price is higher, also the average size of the unit is larger. Second, the variable promotion is included. This variable indicates the number of purchases that were in promotion. In case of 20,9% the purchases were promoted.

The variable competitor purchases describes the number of purchases of a different brand in a certain week with a maximum of nine competitor purchases. However, most households only bought a competitor product once on a weekly basis.

(22)

categories are created for interpretational reasons. In case of education, three education levels are created in line with the Central Bureau for Statistics in the Netherlands (CBS, 2013). The complete transformation of each socio-demographical variable is added as Appendix C1.

(23)

4

Methodology

4.1

Models

Not all households in the data participated in one of the media consumption panels. Since this research focuses on the effects of online and offline marketing instruments, the data is splitted in three subsets. First, a dataset containing households that participated in both panels is created. This dataset is used to test whether both online and offline marketing instruments influence the purchase behavior of households and subsequently the synergy variables are included and tested. Second, a dataset with households that participated only in the online panel is created. This dataset is used to test the effects of online instruments together with the online synergy variable that is included. Lastly, a dataset with households that participated only in the offline panel is created. This dataset is used to test the effects of offline instruments only.

These three subsets result in three models that are used in the comparative analysis of the predictive modelling techniques (Table 4). In each model, the number of purchases is the dependent variable which can be explained by multiple predictors specified for each household in a certain week. For each technique, the same set of predictors are used.

TABLE 4: Model specification per dataset

1 an interaction term with price is included in the model for this variable

2 an interaction term with promotion is included in the model for this variable

4.2

Sampling

A key aspect in the process of supervised learning is the two-step process where the data is trained and tested. Moreover, the learning methods are provided with the inputs as well as the actual outcomes (Kübler et al., 2016; Witten & Frank, 2005). A training sample is created for building and training the model and the testing sample is used to evaluate the actual performance of the predictive model (Bose & Chen, 2009).

Subset Model specification

(24)

In this case several supervised learning methods are performed on the training data and later applied on new unseen data to make predictions (James et al., 2013; Kotsiantis, 2007). Most important is that in case of machine learning: “the goal is to maximize its predictive accuracy on the new data points—not necessarily its accuracy on the training data” (Dietterich, 1995).

Due to a large number of individual records, a large training sample size is chosen and a smaller testing sample (Table 5). A fairly used distribution is that 75% of the data is used for training and 25% is used for testing (Kübler et al., 2016). Same training and testing samples are used for all the learning algorithms to get a fair comparison between the models.

TABLE 5: Training and testing sample size

4.3

Generalized linear models

Generalized linear models are very commonly used in marketing (Kübler et al., 2016). One of the most common supervised learning techniques is a classical linear regression (LR). This method is widely recognized in the marketing literature and can be seen as a gold standard since it performs well across many applications. Therefore, a linear regression is seen as the benchmark method in this research.

A big advantage of classical linear models is that the models have an assumed function and a set of estimated parameters. Due to the low complexity, the individual effects of different predictors can be easily assessed. By estimating the coefficients, effects are easily quantifiable. Furthermore, the usage of a more restrictive model with a predetermined shape of the function, is easier to interpret compared to more flexible models. Due to the low complexity, usually linear models perform good on newly unseen data (Dreiseitl & Ohno-Machado, 2002).

Due to the fact that this research investigates the number of purchases, a different methodology and distribution is preferred. Since the number of purchases can be seen as a count variable, a Poisson distribution or a negative binomial distribution is most appropriate (Brockett et al., 1996; Ehrenberg, 1959; Witten & Frank, 2005).

In the ideal situation of count data, the mean and the variance of the number of purchases is equidispersed. However, in many applications this assumption does not hold. For every dataset, a Lagrange multiplier test is performed. From this test it becomes evident that all datasets are overdispersed (Table 6), meaning that the conditional variance is exceeding the conditional means of the dependent variable (Månsson, 2012). In case of overdispersion, a negative binomial distribution is more suitable (Blattberg et al., 2008). One of the key reasons that a negative binomial regression (NBR)

(25)

is more appropriate in case of overdispersion, is that it is able to deal with random variation in the mean of the dependent variable (Månsson, 2012).

TABLE 6: Mean and variance for number of purchases

In case of the negative binomial regression, three parametric models are specified (Table 7). All model elements are described in detail in Appendix D1. To obtain the parameters for the NBR, maximum likelihood estimation (MLE) is used (Blattberg et al., 2008; Leeflang et al., 2015). This principle seeks to find estimates in such a way that the model matches the data as good as possible. After the beta coefficients are estimated in the model, these can be interpreted additively. This means that the number of purchases ln(λ) changes with the coefficient β1 if the predictor changes with one unit. In case of categorical variables, this is compared to the specified baseline category.

TABLE 7: Parametric models for negative binomial regression

Subset Model specification

Both panels 𝜆𝑖𝑡= 𝛼1+ 𝛽1𝑌𝑜𝑢𝑇𝑢𝑏𝑒𝑖𝑡12+ 𝛽2𝑇𝑉𝑖𝑡12+ 𝛽3𝑂𝑛𝑙𝑖𝑛𝑒𝐷𝑖𝑠𝑝𝑙𝑎𝑦𝑖𝑡12 + 𝛽4𝑂𝑛𝑙𝑖𝑛𝑒𝑆𝑦𝑛𝑒𝑟𝑔𝑦𝑖𝑡 + 𝛽5𝑂𝑛𝑙𝑖𝑛𝑒𝑂𝑓𝑓𝑙𝑖𝑛𝑒𝑆𝑦𝑛𝑒𝑟𝑔𝑦𝑖𝑡+ 𝛽6𝑃𝑟𝑖𝑐𝑒𝑖𝑡2+ 𝛽7𝑃𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑖𝑡1+ 𝛽8𝐶𝑜𝑚𝑝𝑒𝑡𝑖𝑡𝑜𝑟𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒𝑠𝑖𝑡+ 𝛽9𝐴𝑔𝑒𝑖 + 𝛽10𝐼𝑛𝑐𝑜𝑚𝑒𝑖 + 𝛽11𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽12𝑆𝑖𝑧𝑒𝐻𝑜𝑢𝑠𝑒ℎ𝑜𝑙𝑑𝑖 + 𝜀1 Online panel 𝜆𝑖𝑡= 𝛼2+ 𝛽1𝑌𝑜𝑢𝑇𝑢𝑏𝑒𝑖𝑡12+ 𝛽2𝑂𝑛𝑙𝑖𝑛𝑒𝐷𝑖𝑠𝑝𝑙𝑎𝑦𝑖𝑡12 + 𝛽3𝑂𝑛𝑙𝑖𝑛𝑒𝑆𝑦𝑛𝑒𝑟𝑔𝑦𝑖𝑡 + 𝛽4𝑃𝑟𝑖𝑐𝑒𝑖𝑡2+ 𝛽5𝑃𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑖𝑡1+ 𝛽6𝐶𝑜𝑚𝑝𝑒𝑡𝑖𝑡𝑜𝑟𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒𝑠𝑖𝑡+ 𝛽7𝐴𝑔𝑒𝑖 + 𝛽8𝐼𝑛𝑐𝑜𝑚𝑒𝑖 + 𝛽9𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽10𝑆𝑖𝑧𝑒𝐻𝑜𝑢𝑠𝑒ℎ𝑜𝑙𝑑𝑖 + 𝜀2 Offline panel 𝜆𝑖𝑡= 𝛼3+ 𝛽1𝑇𝑉12+ 𝛽 2𝑃𝑟𝑖𝑐𝑒𝑖𝑡2+ 𝛽3𝑃𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑖𝑡1+ 𝛽4𝐶𝑜𝑚𝑝𝑒𝑡𝑖𝑡𝑜𝑟𝑃𝑢𝑟𝑐ℎ𝑎𝑠𝑒𝑠𝑖𝑡+ 𝛽5𝐴𝑔𝑒𝑖 + 𝛽6𝐼𝑛𝑐𝑜𝑚𝑒𝑖 + 𝛽7𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛𝑖 + 𝛽8𝑆𝑖𝑧𝑒𝐻𝑜𝑢𝑠𝑒ℎ𝑜𝑙𝑑𝑖 + 𝜀3

1 an interaction term with price is included in the model for this variable

2 an interaction term with promotion is included in the model for this variable

4.4

Machine learning techniques

According to Kübler et al. (2016) generalized linear models can be seen as a machine learning technique. However, the restrictions that go hand in hand with classical regression models seem to be a disadvantage in many applications. By contrast, machine learning techniques are more flexible and very useful to discover patterns and understand marketing problems. Still, it is not so often used in the marketing literature (Kübler et al., 2016).

Mean Variance Ancillary Parameter < 0

Lagrange Multiplier Test

Both panels 0,0872 0,100 1,000 Overdispersion

Online panel 0,0921 0,109 1,000 Overdispersion

(26)

This research includes the following typical machine learning techniques: Artificial Neural Networks (ANN) and Support Vector Machines (SVM). Due to the fact that a big challenge is the danger of multicollinearity in this research, these machine learning techniques are very useful. Due to a non-linear relationship between the inputs (independent variables) and the output (dependent variable), these methods perform very well in situations where multicollinearity occurs (Kotsiantis, 2007).

Artificial neural networks (ANN) are based on the natural logic of a human brain where neurons are connected to each other. Inputs and outputs are used to establish meaningful relationships during a training and a testing phase (Linder et al., 2004). During the training process weights are assigned to the inputs. During the process the algorithm tries to adjust the weights in such a way that the output is as closest to the desired output (Kotsiantis, 2007). ANNs consist of three layers. First the input layer with input variables. Second, a hidden layer, where non-linear relationships can be captured. Last, the output layer, the inputs are transformed to a prediction indicating in this case the number of purchases (Kaefer et al., 2005). Key to mention is that hidden layers are unobservable (Kübler et al., 2016). The greatest ability of ANNs is that it is able to handle interactions and non-linear relationships (Blattberg et al., 2008).

The main principle behind SVM can be described as the way that: ‘input vectors are non-linearly mapped to a very high-dimension feature space’ (Cortes & Vapnik, 1995). The feature space can be described as the surface where linear decisions are created. In SVM an optimal hyperplane is constructed to distinguish the observations into different classes, in this case the number of purchases. Essential for SVM is that it is able to establish non-linear functions with a linear approach (Kübler et al., 2016). SVM accommodates non-linearity by using kernel functions (Cui & Curry, 2005; Dreiseitl & Ohno-Machado, 2002). Cortes & Vapnik (1995) illustrate that this makes this machine learning algorithm very generalizable. According to Blattberg et al. (2008) this technique is performing very well in classification and prediction tasks, but is very complex compared to other machine learning techniques. The SVM method is performed as provided by Kübler et al. (2016), where all required details of SVM are explained.

It is argued that SVM and ANN perform much better than linear regression methods (Caruana & Niculescu-Mizil, 2006). This is especially the case when the data is large, multi-dimensional and multicollinearity exists (Kotsiantis, 2007). However, ensemble methods seem to perform even better than SVM and ANN according to Caruana & Niculescu-Mizil (2006). These methods are discussed in the next paragraph.

4.5

Ensemble learning

(27)

than just a single classifier (Skurichina & Duin, 2002). In many cases, ensemble methods are performed in combination with decision trees.

Decision trees are widely applied in the marketing literature since there are many advantages for using this method. Due to graphical attractiveness of the method, it is very easy to interpret the outcome (James et al., 2013). A widely used decision tree method is CART (Classification and Regression Trees). For this research regression trees are best applicable since the outcome is continuous. A decision tree used the complete training dataset to split customers in mutually exclusive subgroups that are as pure as possible (Blattberg et al., 2008; Kübler et al., 2016). A decision tree starts at the root node and splits the data in child nodes. The final nodes are often called terminal nodes. The tree is stopped based on a splitting rule. In this particular case a well-known splitting rule is used for the CART method: the Gini-index. How good a split is, is indicated by the decrease in impurity (Blattberg et al., 2008).

Although the many advantages of decision trees, the performance is in many cases lower than more advanced methods such as SVM and ANN (Blattberg et al., 2008; Caruana & Niculescu-Mizil, 2006; Elith et al., 2008). Furthermore, growing too big trees that take many predictors into account may lead to models that are overfitted and perform bad in new predictions with freshly incoming data (Kübler et al., 2016).

Overfitting is a problem that not only applies to decision trees but to all machine learning algorithms. This phenomenon can be defined as: the risk that the model will follow the errors too closely

and therefore will perform poorly on new data points (Dietterich, 1995; Hastie et al., 2009; James et al.,

2013; Kübler et al., 2016). One way to overcome overfitting is by pruning the decision trees (Breiman et al., 1984). By pruning, the tree is simplified and makes it more generalizable. Another way to overcome overfitting is by aggregating multiple trees with ensemble methods (James et al., 2013). This research incorporates the three most common ensemble methods: bagging, random forests and boosting. First, bagging is applied. Bagging is a method that can improve the stability of the predictions, especially in combination with decision trees (James et al., 2013). Bagging consists of bootstrapping with replacement and aggregation (Skurichina & Duin, 2002). New bootstrapped datasets are used as new learning sets to perform decision trees on each new learning set. Aggregating is the process where multiple predictions are averaged (Breiman, 1996). In this case, growing multiple trees are performed on multiple training sets which should lead to a significant increase in accuracy (Breiman, 1996).

Second, random forests is applied. Random forest classifiers are fundamentally different from bagging since only a subset of the total number of predictors are used which prevents the decision trees from correlating with each other (Breiman, 2001). This means that every time a split is considered only

m independent variables are used to separate the observations. Also in case of the random forests

method, the decision trees are built on a number of new bootstrapped datasets.

(28)

is performed, the errors are taken into account in the next step, meaning that it adapts to previous estimation results (Elith et al., 2008; Lemmens & Croux, 2006). By mainly focusing on the residuals, the predictive performance of the model is so-called boosted. One of the key differences between bagging, random forests and boosting is that in case of the latter one, each tree is built on a modified dataset instead on a new bootstrapped dataset (James et al., 2013).

Previous research showed that boosted trees is the best performing method followed by random forests and bagged trees (Caruana & Niculescu-Mizil, 2006). According to Dietterich (2000) boosting is the most accurate ensemble method, and random forests is competitive with bagging. Whether these results also hold in this study in a FMCG setting, is explained in the next chapter.

4.6

Performance measures

Several performance measures are used to assess the predictive validity of all the modelling techniques that are performed. Since the dependent variable is numeric, the following criteria are used: mean square error (MSE), the root mean square error (RMSE) and the mean absolute error (MAE) (Blattberg et al., 2008; Leeflang et al., 2015; Witten & Frank, 2005). These measures also have already been used for comparing machine learning algorithms in a different setting (Razi & Athappilly, 2005) and are specified in Table 8.

TABLE 8: Formulas for calculating performance measures (Blattberg et al., 2008).

Performance measure Formula

MSE ∑ 𝑒𝑖2/𝑛 𝑛 𝑖 =1 = ∑(𝑌𝑖− 𝑌̂𝑖) 2/𝑛 𝑛 𝑖=1 RMSE √∑ 𝑒𝑖2/𝑛 𝑛 𝑖 =1 = √∑(𝑌𝑖− 𝑌̂𝑖) 2/𝑛 𝑛 𝑖=1 MAE ∑|𝑒𝑖|/𝑛 𝑛 𝑖 =1 = ∑|𝑌𝑖− 𝑌̂𝑖|/𝑛 𝑛 𝑖=1 𝑒𝑖 prediction error 𝑌𝑖 actual outcome 𝑌̂𝑖 predicted outcome

(29)

root of the MSE the RMSE is obtained. After this transformation the value has the same dimension as the predicted values (Witten & Frank, 2005).

A different accuracy measure is the MAE, which calculates the Euclidean distance between the predicted and the actual outcomes (Blattberg et al., 2008). In this case, the error is expressed relatively to the actual outcome. Therefore, it is more robust and handles outliers better than the MSE and RMSE.

4.7

Software package

Machine learning finds its origins in the computer science (Cui & Curry, 2005). A wide variety of software packages are available nowadays. This research makes use of R and R Studio to perform all the machine learning techniques. This software package provides great advantages in terms of estimation techniques and graphical representation. A great variety of extensions result in effective predictive modelling. Helpful guidelines are used in performing this comparative analysis (Cran, 2016; James et al., 2013). A summary of all the techniques used in R can be found in Table 9. Foremost, default settings are used to get a fair comparison between the methods that are used.

TABLE 9: R software packages used per predictive modelling technique

Modelling technique R packages Specification of function(s)

Linear Regression Pscl function: glm

Negative Binomial Regression

Pscl, MASS function: glm.nb

Neural Networks Nnet, NeuralNetTools function: nnet, entropy: least squares

Support Vector Machines e1071 function: svm, type: svm for c-classification

Decision and Pruned Trees rpart, rattle, party, partykit, caret

function: recursive partitioning (rpart) stopping criterion: gini, function: prune

Bagging adabag, party, ipred function: bagging

Random Forests randomForest, pROC function: randomForest

(30)

5

Results

5.1

Negative binomial regression

In case of the negative binomial regression, three parametric models are estimated with three different data samples. As mentioned, a big advantage of a predetermined functional form is that it results in parameters to quantify the effects. This differs from other machine learning techniques where the estimation results in different types of output, for example weights and splits, which are much harder to interpret due to the complexity. Table 10 shows the parameter coefficients for the both panels model. Small differences exist between the three models. Foremost, the models show great similarities (Appendix E1).

TABLE 10: NBR parameter estimates both panels

Reference categories: Age 12 - 24 | HH_Size 1 | Income 0 – 1500 | Education High * p < .05 ** p < .01 *** p <.001

Besides the parameter assessment of the models, the fit of the model is also assessed by comparing it with a null model. This is done by a likelihood ratio test and via the Cox & Snell pseudo R2 (Table 11). In terms of model fit three models are performing significantly better than a null model.

Also in terms of the Cox & Snell pseudo R2 statistics, the model is compared with a null model (Leeflang

et al., 2015). Although the both panels model includes more explanatory variables, the highest pseudo R2 belongs to the online panel model.

TABLE 11: Likelihood ratio test and Cox & Snell pseudo R2

β β Intercept -4,187*** Age 25 - 34 0,2204 YouTube -0,353 Age 35 - 44 0,1554 TV 0,042 Age 45 - 54 0,1613 OnlineDisplay -2,167 Age 55 - 74 0,2112 OnlineSynergy -30,078 Age 75> 0,1261 OnlineOfflineSynergy -0,061 HH_Size 2 0,07122 Price 2,312*** HH_Size 3 -0,1295 Promotion 2,267*** HH_Size 4 0,08343 CompetitorPurchases 0,084 HH_Size 5> 0,1568 Price*Promotion -1,016*** Income 1500 - 2300 0,2249* Price*YouTube 0,388 Income 2300 - 3100 -0,0242 Price*TV -0,043 Income 3100 - 4100 0,08718 Price*OnlineDisplay 1,272 Income 4100> 0,0182

Promo*YouTube 0,388 Education Low -0,03594

Promo*TV -0,399 Education Medium -0,1451

Promo*OnlineDisplay -

Loglikelihood P-value Cox & Snell R2

Both panels -1909.3 2.2e-16*** 0,791

Online panel -14023 2.2e-16*** 0,799

Referenties

GERELATEERDE DOCUMENTEN

Ten slotte zijn er twee interactie effect gevonden: meer effortful control en psychologische controle gerapporteerd door vaders is gerelateerd aan het uiten van minder

All isolates exhibiting reduced susceptibility to carbapenems were PCR tested for bla KPC and bla NDM-1 resistance genes.. Overall, 68.3% of the 2 774 isolates were

Logeeropvang voor kinderen die permanent toe- zicht nodig hebben (meer dan gebruikelijke zorg).. Voor wie &gt; Ontlasten van cliënt én mantelzorger Aanvragen bij

The goal of this study was to investigate the added value of machine learning algo- rithms, compared to a heuristic algorithm, for the separation clean from noisy thoracic

Learning modes supervised learning unsupervised learning semi-supervised learning reinforcement learning inductive learning transductive learning ensemble learning transfer

Learning modes supervised learning unsupervised learning semi-supervised learning reinforcement learning inductive learning transductive learning ensemble learning transfer

The NS scores and normalized RMSE values suggest that machine learning can be effectively applied for predicting a wide range of concrete properties Therefore, ML models trained on

The MCTS algorithm follows the implementation previ- ously applied to perfect rectangle packing problems [67]. The search procedure starts at the root node with an empty frame and