Generative adversarial networks in marketing:
Overcoming privacy concerns with the generation of fake data.
Generative adversarial networks in marketing:
Overcoming privacy concerns with the generation of fake data.
Gilian Ponte
S2591634
Master Thesis
Supervisor: Prof. dr. J. Wieringa
Second supervisor: dr. K. Dehmamy
Rijksuniversiteit Groningen
Faculty of Economics and Business
Department of Marketing
PO Box 800
9700 AV Groningen
Abstract
Privacy is a fundamental human right. Over the years the right to privacy has come under
pressure by the growth of the internet, methodologies and amount of data. These privacy
concerns have led to individuals reacting negatively to the collection and usage of individual
data. The development of the methods that pressure the fundamental right to privacy, are now
advanced to alleviate these same privacy issues. The recent developments surrounding
generative networks allow the generation of individual fake data based on any real data
distribution. This study is the first empirical attempt to alleviate privacy concerns by means of
a generative adversarial network in the field of marketing. Consequently, for three data sets a
generative adversarial network and a Wasserstein generative adversarial network are
developed. These networks successfully generate fake data that is useful in marketing modeling
cases. Surprisingly, this study shows that estimations on fake data are able to outperform
estimations on real data. The current study specifies that academics and firms are able to
generate fake data to alleviate privacy concerns among individuals, promote data sharing and
even advance the development of theory in all academic disciplines.
Preface
I started my academic career at the Rijksuniversiteit Groningen as a pre-master student in
marketing. During this first year I was uncertain whether I was capable of ever doing a master
during the course of my life. Nonetheless, I believed and convinced myself that when you really
want something, you are always capable of doing so. Two years later, this paper is the result of
a very enjoyable period of time as a MSc Marketing Intelligence student.
During this academic year, I was able to feed my curiosity with courses such as: Data Science
& Marketing Analytics, Market Models, Digital Marketing Intelligence and Customer Models.
These courses provided me a thorough background in modeling. Specifically, during these
courses my interest was directed at the recent developments in artificial intelligence and deep
learning. Over the course of my masters, the development of generative networks gained
exponential attention within and outside the literature of artificial intelligence. I remember
asking Prof. dr. Wieringa during one of these courses whether the marketing literature has
investigated the applications of GANs in a privacy setting. Privacy is an issue, in and outside
marketing, that affects a substantial amount of people. Personally, the development of privacy
protective methods have not gained adequate attention throughout the years. These
developments led to the subject of my master thesis.
First of all, I would like to thank Prof. dr. Wieringa for the elaborate support during and outside
the process of writing my thesis. For example, the support in the application process for the
Research Master and possibly succeeding PhD. I am hoping and looking forward to collaborate
in research on the development of privacy preserving methods in the future. Secondly, I would
like to thank my family for supporting me in many ways, raising me with a persevering and
curious attitude and for always believing in me. In discussions my family was always able to
shed a different light on my research. Finally, I would like to thank my friends with special
gratitude towards: Hidde Smit, Diede Wieldraaijer, Mats Neeft, Felix Lehmkule, Muthia
Khanza and Wisse Smit for all the support and fun I enjoyed from our friendship over the years.
Gilian Ponte
Table of contents
1. Introduction ... 1
1.1 Practical and academic contributions ... 2
1.2 Research question ... 5
1.3 Structure of the study ... 5
2. Theoretical framework ... 5
2.1 The ability of GANs to generate customer data ... 5
2.2 Privacy issues ... 8
2.3 Marketing modeling ... 9
3. Generative Adversarial Networks ... 13
3.1 Formal objective ... 13
3.2 Loss functions ... 14
3.3 Gradient descent and learning rate ... 16
3.4 Optimization algorithms ... 18
3.5 Activation functions ... 19
3.6 Training procedure ... 21
3.7 Non-convergence ... 22
3.8 Developments towards a stable GAN ... 23
4. Research design ... 26
4.1 Data description ... 26
4.2 Data normalization ... 30
4.3 GAN - architecture ... 30
4.4 Wasserstein GAN - architecture ... 36
5. Results ... 39
5.1 The correlation matrix of the fake data set correlates with the real data correlation
matrix. ... 39
5.3 The predictive accuracy of machine learning techniques is significantly lower on fake
data than on real data. ... 46
5.4 The addition of generated fake data to real data will significantly increase the predictive
accuracy of machine learning techniques. ... 50
5.5 The parameters are equal between the model based on generated fake data and an
estimation on the real data. ... 51
5.6 The effect of generated data on the MAPE, Theil U-statistic and RAE. ... 63
6. Discussion ... 67
7. Conclusion and future research ... 70
7.1 Conclusion ... 70
7.2 Limitations ... 71
7.3 Future research ... 72
References ... 75
Appendices ... 85
Appendix 1: Generative adversarial networks code ... 85
Appendix 2: Analysis of fake and real data code ... 109
Appendix 3: Wasserstein GAN ... 201
1
1. Introduction
Marketing has a rich history of modeling data in efforts to understand customer behaviour and
the effectiveness of marketing channels. Wedel & Kannan (2016) present an outline of the
timeline of marketing data and analytics, from survey data in 1900 accompanied by OLS and
ANOVA to the development of methods for social and location data in the present (e.g., Nam
& Kannan, 2014: Büschken & Allenby, 2016). As new types of data became available, the
development of new methods naturally followed. Parallel to the emergence of new data types
and methods, the volume of data increased which is often referred to as one of the characteristics
of big data (Sarioglu & Sinanc, 2013: Wedel & Kannan, 2016). The rise and popularization of
the internet and emergence of social media have been a considerable game changer for
companies and marketing academics to collect data and a resource to rich data sets containing
detailed information on individual activities of users (Bucklin & Sismeiro, 2009). Nevertheless,
the growing volume of data comes with a set of disadvantages. Challenges in privacy, data
sharing, storage, computation power and noise in data are a reality for every company or
academic when handling big data (Leeflang, Wieringa, Bijmolt & Pauwels, 2017).
Yann LeCun (2018) who is considered to be one of the founding fathers of the field of artificial
intelligence described a new exciting methodological development: “Adversarial training is the
coolest thing since sliced bread” or “The most exciting thing in Deep Learning”. The
introduction of generative adversarial networks (GAN) by Ian Goodfellow, et al. (2014) was
received with much enthusiasm by the artificial intelligence community. While marketing has
a history of employing techniques from computer science or the field of artificial intelligence,
generative adversarial networks have not been studied yet in a marketing context. One could
find this quite striking considering the potential solutions it might offer to the challenges that
are present in the field of marketing. Especially, since a GAN has the potential to solve or
alleviate the privacy issues that are present in the field of marketing.
2
model-based approaches for privacy protection in marketing. The current study is an attempt to
empirically explore the ability of generative adversarial network to generate customer data
while preserving customer privacy. This study shows that the generative adversarial networks
are able to generate fake data that can be used to alleviate privacy issues and enhance data
sharing among academics and practitioners.
1.1 Practical and academic contributions
In the field of marketing, firms annually spend around 36 billion dollars to capture and leverage
customer data (Columbus, 2014). The vast amount of investment in leveraging customer data
combined with the growth of data and possibilities to capture individual customer data led to
privacy concerns (van Doorn & Hoekstra, 2013). Consequently, the European Union and the
United States created legislation to regulate and protect the individual data of customers
(European Parliament, 2013: PCAST, 2014). A practical example of these developments are
the recent events concerning Facebook and Cambridge Analytica. Cambridge Analytica was
seeking for a way to enrich customer data for modeling purposes. Data sharing enables
companies to collaborate and enrich data sets to get a better customer view and enhance the
performance of customer models (Peltier, et al. 2013). Therefore, Cambridge Analytica and
Facebook shared data of more than 87 million users (Reuters, 2019). The risks of such
endeavours are clear, companies are at risk for potential privacy costs, losses in brand value,
legal fines or customer trust (Schneider, Jagpal, Gupta, Li & Yu, 2017). For Facebook, the data
breach led to a decrease of more than 119 billion dollars’ worth of Facebook in market value
due to the exposure of data from the 87 million Facebook users (Neate, 2018). To illustrate the
magnitude of such an event, the total damage in stock value was equal to the value of
McDonald's at the time. The scandal led to the termination of the firm Cambridge Analytica.
Nowadays, Facebook continues to suffer from the consequences of the data breach. The FTC
continues to investigate Facebook and the company may face a multibillion-dollar fine
regarding the Cambridge Analytica events (Reuters, 2019).
3
dimensionality. Leeflang, Wieringa, Bijmolt, & Pauwels (2017) describe machine learning
techniques that are better able to deal with these shortcomings. Therefore, when handling big
data marketing research needs to rely more on recent estimation techniques and modeling
approaches from other fields as machine learning.
The advantages of these methods were not immediately evident among academics and
practitioners. The development of machine learning has a long history of setbacks and progress.
In 1950, Alan Turing, the British famous mathematician, questioned himself whether machines
would be able to think, thus proposing the idea of a machine that is able to learn and become
artificially intelligent (Turing, 1950). As a result, only a year later, one of the first machines
that were able to learn from data were developed, introducing a new field of research called:
‘Artificial Intelligence” often referred to as AI (McCarthy, 1974). Around the 1990‘s the
discovery of backpropagation by Rumelhart, Hinton & Williams (1986), the development and
rise of the internet, increasing computational power and the vast amount of data contributed to
the increase of interest and funding into AI research. The field of AI started to show promising
results and even surpass human performance in specific tasks. For example, the development
of the Google Search Engine which enables users to search the internet. DeepMind developed
Alpha Go Zero that defeated the world champion in the Chinese ancient game of Go (Silver, et
al. 2017). Nowadays, machines are even able to defeat humans in video games (Vincent, 2018:
DeepMind, 2019). The introduction of the idea to use the architecture of the brain to create
machines that are able to learn has, as we now know, great consequences in almost every
academic field.
4
is used for empirical studies often consists only of one field or company. So that, empirical
results often are limited to one industry, which requires additional research to generalize the
results. The fake data generated by a GAN could be shared between academics and practitioners
since it is not privacy sensitive.
However, from using this methodology potential issues arise. First of all, machine learning,
especially generative modeling, could be regarded as more engineering-oriented. It is a
hands-on discipline in which new ideas are proven empirically more often than theoretically.
Especially, the architecture of a GAN is often derived from practical experience and
engineering rather than theory. Goodfellow, et al. (2014) confirms this notion by stating that a
great deal of hyper parameterization is required to successfully generate data. Secondly,
Goodfellow, et al. (2014) identifies that GANs are hard to train and converge. The successful
convergence of a GAN is a subject of ongoing research. Moreover, GANs do not have a
measure to monitor while training your model, which makes identifying at which state the GAN
generates realistic data samples is considered to be a challenge. The convergence of a GAN is
often assessed by subjectively looking at the fake samples, instead of a single objective metric.
Finally, Wedel & Kannan (2016) raise a legitimate question of whether marketing academics
should adopt machine learning and neglect traditional methods, because machine learning
techniques do not yet establish causal effects or generalizable theoretical insights. This is an
active stream of research (e.g., SHAP by Lundberg & Lee (2017)). Therefore, machine learning
techniques, especially neural networks, are often considered “black boxes”. The interpretability
of these machine learning methods is limited. Whereas, more traditional methods are able to
identify the effects on a dependent variable (e.g., logistic regression or linear regression).
5
1.2 Research question
To conclude, the goal of this study is to investigate what the consequences of fake data are in
marketing modeling purposes. The overall goal consists of the following sub-research
questions. First, the ability of a GAN to generate data is investigated resulting in the following
research question: “How does a GAN generate fake data?”. As mentioned, the creation of fake
data based on the distribution of real data could mitigate privacy issues, summarized in the
second research question, namely: “Is a GAN able to alleviate privacy issues by the generation
of fake data?”. Evidently, the performance of the generated fake data needs to be comparable
or outperform the real data in marketing modeling. Therefore, the last research question is
defined as: “What is the effect of fake data compared to real data in the predictive accuracy of
marketing models?
1.3 Structure of the study
The remainder of this study is organized as follows: Chapter 2 touches upon the current
literature on GANs and their implications for marketing modeling exercises. Chapter 3 provides
a detailed overview of the methodology developed in this study, with special attention for the
current state-of-the-art GAN architectures. Chapter 4 describes the processes that led up to the
development of the GAN and the fake data. Followed by chapter 5, that elaborates on the results
and the tested hypothesis. These results are discussed in chapter 6 and chapter 7 gives an
overview of limitations and future research recommendations.
2. Theoretical framework
2.1 The ability of GANs to generate customer data
6
indicate that the importance and influence of techniques from AI in the marketing literature is
unambiguous.
LeCun, Bengio & Hinton (2015) continued to develop the concept of neural networks and
introduced the field of deep learning. Deep learning encompasses neural networks with, instead
of one processing layer, multiple processing layers (i.e., a higher level of abstraction). This
development sparked a stream of research into deep neural networks that dramatically improved
the performance in the tasks of speech-recognition, visual object recognition, image
classification, topic classification, sentiment analysis, question answering and language
translation and processing videos (Krizhevsky, Sutskever & Hinton, 2012: Bordes, Chopra &
Weston, 2014: Sutskever, Vinyals & Le, 2014: LeCun, Bengio & Hinton, 2015). Whereas,
recurrent neural networks showed improvements in data with long-range dependencies. For
example, the analysis of sequential data such as text and speech (Graves, Mohamed & Hinton,
2013).
The introduction of generative adversarial networks or GANs by Goodfellow, et al. (2014) has
shown promising results in the field of generating images, videos and audio data. Probably the
most recent and well-known example is the generation of celebrity faces by Karras, Aila, Laine
& Lehtinen (2018). The authors generated fake high-resolution images of celebrity faces that
are indistinguishable from real images of faces. Lotter, Kreiman & Cox (2015) show that GANs
are able to generate the next frame in a video sequence. Another application is to generate a
high-resolution image from low-resolution images (Ledig, et al. 2016). Isola, et al. (2016) show
that GANs are very creative. The authors demonstrate the ability of a GAN to fill in the colour
of sketches of images that corresponds to the ground truth. For example, a sketch of a handbag
is filled in with the colour brown which resembles the original image.
7
The authors suggest that in the future, GANs could be applied to other kinds of e-commerce
tasks such as targeting, product recommendation or the simulation of future events. In a case
for medical trial data, Beaulieu-Jones, et al. (2018) describe the successful creation of
participant data using a GAN to facilitate data sharing.
Specifically, Goodfellow (2016) describes that a GAN approximates a density surface in a
high-dimensional space. Intuitively, the high-high-dimensionality relates to the number of variables in a
data set, where all the distributions of variables and relationships between variables are
represented by a surface. Empirically, it is interesting to investigate whether the GAN is capable
to approximate a real surface in a high-dimensional space with high accuracy. For simplicity
reasons, this study refers to the variables in a one-dimensional space. The main motivation for
this choice is the curse of dimensionality described by Goodfellow, Bengio & Courville (2016).
As the number of variables increase, the dimensionality of the density surface increases, which
leads so several statistical challenges. For a detailed description of these statistical challenges
this study refers to Goodfellow, Bengio & Courville (2016). Therefore, the current study
empirically investigates whether the one-dimensional distributions and the correlations within
the real data set are contained in the fake data set. These two conditions are required and often
serve as a proxy to measure and compare two surfaces in a high-dimensional space. Both studies
confirm that the correlations among the generated data variables significantly correlate with the
correlations among the real data set (Beaulieu-Jones, et al. 2018: Kumar, Biswas & Sanyal,
2018). To conclude, this leads to the first hypothesis:
H1: The correlation matrix from the fake data set significantly correlates with the real data
correlation matrix.
As described, a GAN approximates the probability distributions of the real data set in a
high-dimensional space. Therefore when a GAN converges, the generated data distributions
resemble the real data distributions, which is partially confirmed by other studies but has never
been statistically tested (Goodfellow, et al. 2014: Beaulieu-Jones, et al. 2018: Kumar, Biswas
& Sanyal, 2018). Therefore, a sub-hypothesis is developed as follows:
8
2.2 Privacy issues
The literature is conclusive regarding the effects of privacy in marketing. Customers respond
negatively to a firm collection and use of individual data (Doorn & Hoekstra, 2013: Martin,
Borah & Palmatier, 2017). Martin, Borah & Palmatier (2017) describe that a data breach leads
to significant negative stock performance and even spillover effects on the value of other (rival)
companies. However, transparency and control promises positively mediate this effect. This
makes privacy issues not only relevant for companies and marketing academics, but one of their
highest priorities (Kannan & Li, 2017: Marketing Science Institute, 2018).
To potentially deal with privacy issues, Wedel & Kannan (2016) identify two potential actions
to take: data minimization and data anonymization. Data minimization refers to limiting the
amount of data marketers collect and dispose when the marketers do not longer need the data.
This counteracts the notion of generating generalizable results, where rich and data high in
volume is required. Data anonymization is accomplished by k-anonymization, removing
personal identifiable information, recoding, swapping or randomizing data or hashing
algorithms (Reiter, 2010). Nevertheless, in case of a data breach, the data is still considered to
be privacy sensitive (Miller & Tucker, 2011).
Wedel & Kannan (2016) propose that marketing analytics should develop procedures to find a
balance between minimization, anonymization and the degrading diagnostic and predictive
power. A GAN could be considered as the perfect example of a data anonymization technique
that does not minimize the data. Instead of minimization, it generates data based on the real
data distribution what leads to, in theory, very useful data for modeling practices (Goodfellow,
et al. 2014: Wieringa, et al. 2019). Moreover, since the data is created based on a real data
distribution, it is not possible to trace individual real customers back in the fake data in case of
a data breach.
9
an individual level while ensuring individual privacy protection. Instead of focussing on the
methodology to mitigate privacy issues in the study of Holtrop, et al. (2017), this study focusses
on the generation of fake data with a GAN to preserve the privacy of consumers.
Schneider, Jagpal, Gupta, Li & Yu (2018) describe that the generation of fake data provides an
important advantage over other privacy protection measures. Namely, the generation allows
theoretical guarantees of privacy (e.g., differential privacy). Beaulieu-Jones, et al. (2018)
employ differential privacy in the architecture of their GAN, which allows the authors to control
for the privacy of the participants by adding a small amount of random noise to the weights of
a GAN. A more technical description of differential privacy is available in Abadi, et al. (2016).
Naturally, the ability of GANs to mitigate privacy issues is conditional on the performance of
the fake data or combinations of fake and real data in marketing modeling practices.
2.3 Marketing modeling
Several empirical studies in the subfields of marketing literature report the limitation of having
limited data. The literature ranges from churn prediction (Neslin, et al. 2006: Holtrop, et al.
2017), attribution modeling (Anderl, Becker, von Wangenheim & Schumann, 2016) to
modeling markets (van Heerde, Gijsenberg, Dekimpe & Steenkamp, 2013). Especially, the
performance, staying power and architecture of churn models have been heavily studied in the
marketing literature (Ha, Cho & Maclachlan, 2005: Lemmens & Croux, 2006: Neslin, et al.
2006: Risselada, Verhoef & Bijmolt, 2010: Ascarza & Hardie, 2013: Holtrop, et al. 2017).
Risselada, Verhoef & Bijmolt (2010) describe that the predictive accuracy of the churn models
heavily depends on the volume and nature of the data set (e.g., no free lunch theorem). The
authors describe that a logistic regression outperforms when the volume of data is fairly limited.
However, when having big data sets (n ≈ 1000), neural networks, tree methods and ensemble
methods outperform other methods (Perlich, Provost & Simonoff, 2004). This implies that
when generating additional fake data to the real data, this most likely has the highest impact in
predictive accuracy for neural networks, tree methods and ensemble methods.
10
while preserving individual privacy of medical patients. However, this difference is not tested
for significance, nor in a marketing context and a limited amount of machine learning methods
are employed. This leads to the second hypothesis:
H2: The predictive accuracy of machine learning techniques is significantly lower on fake
data than on real data.
Complementary to Beaulieu-Jones, et al. (2018), this study investigates whether the addition of
the generated fake data, to the real data, increases the predictive accuracy of the machine
learning techniques in a churn context. From an artificial intelligence perspective, Goodfellow,
Shlens & Szegedy (2014) use fake data points, also called “adversarial examples”, to increase
the performance of predictive models. The goal is to make the algorithm robust and learn from
the adversarial examples that are extremely similar to the real samples and being able to separate
them with high confidence.
From a marketing perspective, Risselada, Verhoef & Bijmolt (2010) describe how the
performance of machine learning techniques heavily depends on the nature and amount of data
in a case for predicting churn. Also, as the number of observations in a data set increases, the
neural networks, tree and ensemble methods start outperforming the other methods (Perlich,
Provost & Simonoff, 2004). Leeflang, Wieringa, Bijmolt & Pauwels (2015) elaborate on this
idea by describing how in general with more variability in data samples, the predictor variables
are estimated with more precision. This relates to the property of a GAN to be able to generate
data with high stochasticity and variability within, or sometimes even outside (see Figure 7 or
9), the range of the real data distribution (Goodfellow, et al. 2014). Therefore, when adding
generated samples from a GAN to the real data, the predictor variables are estimated with more
precision and the predictive accuracy is expected to increase.
This leads to the following
sub-hypothesis:
H2a: The addition of generated fake data to real data significantly increases the predictive
accuracy of machine learning techniques compared to only real data.
11
data is that GANs are only able to generate data on variables that are readily available. Every
introduction course on statistics tells us that having more data reduces uncertainty (i.e., central
limit theory). Therefore, the variation or standard deviation of the parameters decrease which
results in high t-values (e.g., OLS). In theory, when constantly increasing the amount of data
for the parameters, all the parameters have a significant effect on the dependent variable.
As described, the generated samples consist of high variability and stochasticity in terms of the
values of the variables (Goodfellow, et al. 2014). Empirically, Beaulieu-Jones, et al. (2018)
confirmed that the generated fake data consists of a higher variability compared to the real data,
due to the addition of noise to the weigths. Therefore, when constantly increasing the amount
of data for purpose of estimation, the variance in the fake estimation is different compared to
the standard deviation in the estimation on real data. This leads to the development of the third
hypothesis:
H3: The variance of parameters is different in the estimation on generated fake data,
compared to an estimation on the real data.
The question remains whether the parameters, defined as β, in the models change when using
only the fake data. To time of writing, there are no studies utilizing fake data from a GAN for
comparison of the effects in a market model estimated on generated fake data versus a model
estimated on real data. In case of successful convergence of the GAN, the fake data resembles
the real data distribution in a high dimensional space. Therefore, the relationship between the
variables is contained and the effects of parameters in both models are expected to be similar
(Goodfellow, et al. 2014). This leads to the development of the following sub-hypothesis:
H3a: The parameters are equal between the estimation based on generated fake data and an
estimation on the real data.
12
in the fake estimation is equal to the real estimation. However, according to Beaulieu-Jones, et
al. (2018) the variance is higher in the fake estimation due to the noise added to the weights of
a GAN to account for privacy of the individual. When assuming that the parameters are the
same and the variance is different, then the following sub-hypothesis is developed:
H3b: The t-values of parameters differ between the estimation on generated fake data,
compared to an estimation based on real data.
The fact that the generated fake data affects the level of uncertainty and parameters of an OLS
estimation, makes it worthwhile to investigate the effects on the predictive validity between the
model estimated on generated fake data and real data. The predictive validity measures of a
model provide information on the predictive power of the estimation. When these measures are
comparable between the two estimations, this would promote and enable data sharing and
mitigate privacy concerns in marketing. A predictive validity measure that is dimensionless
across models is the Mean Absolute Prediction Error (MAPE). Thus, the MAPE allows for a
comparison of the predictive validity of two estimations regardless of the scale of the dependent
variables. Therefore, this study uses the MAPE to compare predictive performance. There is no
literature available to develop a hypothesis, thus the following empirical issue is posed:
How does the generated fake data influence the MAPE, compared to the estimation on the
real data?
Moreover, it is interesting whether the model outperforms a naïve model that takes the value of
the dependent variable in t-1 as a prediction for the next period (t+1). These measures are the
Relative Absolute Error (RAE) and Theil’s U-statistic. When the outcome of these statistics is
less than one, the model outperforms a naïve model (Leeflang, Wieringa, Bijmolt & Pauwels,
2015). This scenario would imply that the estimation of fake data is worthwhile. Therefore, the
following empirical issue is formulated:
How does the generated fake data influence the RAE and Theil U-statistic, compared to the
estimation on the real data?
13
3. Generative Adversarial Networks
A GAN is described by the idea of a game between two players by Goodfellow (2016). The
first player is a generator that creates samples that have the goal to represent the real training
data distribution. The second player is a discriminator that tries to distinguish between the fake
and real data. This game is best illustrated by a competition between a counterfeiter trying to
create real money and the discriminator that acts as the police to collect all the counterfeit
money, see Figure 1. The goal of the game is to let the generator create good enough samples
that fool the discriminator. When the generator succeeds, the distribution of the samples
resembles the real money.
3.1 Formal objective
To put more formally, both players are functions represented by deep neural networks that are
differentiable in parameters and inputs. GANs that have the objective of generating images are
often referred to as DCGANs introduced by Radford, et al. (2015), which refers to deep
convolutional generative adversarial network. Convolutional neural networks allow for the
preservation of the correlation among variables in a data set. These networks are regarded as
state-of-the-art for GAN implementations since the publication of Radford, et al. (2015). For a
technical review of convolutional neural networks see Schmidhuber (2015) or Goodfellow,
Bengio & Courville (2016). For an extensive review on artificial neural networks see Leeflang,
Wieringa, Bijmolt & Pauwels (2017).
Goodfellow, et al. (2014) define the discriminator as function D that takes x as inputs, where x
is the real or fake data distribution and outputs a probability of a sample to be fake or real in a
range of (0,1). The generator is defined as function G that takes inputs as z, where z is a random
Gaussian distribution over the same dimensionality as the real data distribution, which outputs
a fake sample of data points in a range of (-1,1) in case of a hyperbolic tangent activation
function. These functions or players are competing in a minimax game, see equation (1).
min
𝐺
max
𝐷𝑉(𝐷, 𝐺) = [𝔼
𝑥~𝑝𝑑𝑎𝑡𝑎(𝑥)log 𝐷(𝑥) + 𝔼
𝑧~𝑝(𝑧)log(1 − 𝐷(𝐺(𝑧)))] (1)
14
noise
data set
real/fake
samples evaluated by D to be real. 𝔼
𝑥~𝑝𝑑𝑎𝑡𝑎(𝑥)refers to an expected sample from the probability
distribution of the real data set and
𝔼
𝑧~𝑝(𝑧)refers to an expected sample from the probability
distribution of the fake data set. Specifically, the objective is to maximize the discriminator D
over the real training data x, while minimizing the generator G (i.e., D(x) ≈ 1 and 1 - D(G(z)) ≈
0). This procedure results in the generator creating a data distribution z that approximates the
real data distribution x. Therefore, the generator generates realistic fake data samples. A
simplistic representation of the architecture of a GAN is visible in Figure 1.
3.2 Loss functions
A loss function is a function that the neural networks attempts to minimize. The correct loss
function depends heavily on the dependent variable. For example, in the case of a continuous
dependent variable, the network could minimize the mean squared error or root mean squared
error (MSE & RSME). In the case of a classification problem, the loss function is cross-entropy.
Minimizing this concept of cross-entropy is equal to maximizing the log-likelihood or
minimizing the negative log-likelihood. To illustrate this concept, cross-entropy is defined as
any difference between an empirical probability distribution and the probability distribution
defined by a model, defined as 𝑝
𝑚𝑜𝑑𝑒𝑙(𝑥). For example, in case of classification, the process of
adjusting the weights (θ) of a discriminator by means of a maximum likelihood procedure,
creates a latent probability distribution which approximates the empirical data distribution (for
proof consider, Murphy, 2012: Goodfellow, Bengio & Courville, 2016).
This distance between the empirical data distribution 𝑝̂
𝑑𝑎𝑡𝑎(𝑥) and 𝑝
𝑚𝑜𝑑𝑒𝑙(𝑥) is defined as the
negative log-likelihood, as in equation (2). Here, taking the log prevents the loss function to
saturate, as the likelihood of a function is very close to zero. To illustrate why taking the log is
Generator (G)
Discriminator (D)
15
important, consider a scenario where the likelihood approaches zero due to taking the product
of multiple examples in the maximum likelihood procedure. Here, the gradient of the loss
function is likely to be very small, which results in a very small learning rate and slows down
training or convergence of the GAN (Goodfellow, 2016). Therefore, taking the log allows a
function that increases everywhere, which results in a more stable function and allows faster
convergence.
−𝔼
𝑥~𝑝̂𝑑𝑎𝑡𝑎[log 𝑝
𝑚𝑜𝑑𝑒𝑙(𝑥)] (2)
Therefore, Goodfellow, Bengio & Courville (2016) describe that one way of interpreting
maximum likelihood estimation is to view it as minimizing the difference between the empirical
distribution and the model distribution, also referred to as the Kullback-Leibler divergence.
Theoretically, minimizing the Kullback-Leibler divergence is possible between two Bernoulli,
SoftMax or Gaussian distributions.
3.2.1 Discriminator
Consider the discriminator in Figure 1, the neural network tries to determine which one of the
samples is fake or real by means of a sigmoidal activation function (see section 3.5.1). The
network attempts to minimize the cross-entropy between the real data distribution and the latent
distribution created by the discriminator, by means of adjusting the weights through mini-batch
stochastic gradient descent. Section 3.3 introduces the method of the neural network to
minimize the cross-entropy. The minimization of the cross-entropy leads to matching the real
data distribution to the model’s latent distribution, see Goodfellow, Bengio & Courville (2016)
for more details. The formal cost to minimize the discriminator or cross-entropy is displayed in
equation (3) (Goodfellow, 2016).
𝐽
(𝐷)(𝜃
𝐷, 𝜃
𝐺) = −
1𝑚
𝔼
𝑥~𝑝𝑑𝑎𝑡𝑎(𝑥)log 𝐷(𝑥) −
1𝑚
𝔼
𝑧~𝑝(𝑧)log(1 − 𝐷(𝐺(𝑧))) (3)
16
3.2.2 Generator
From equation (1) and (3) we derive that the generator G attempts to minimize log(1 - D(G(z)).
In other words, it strives to make the discriminator D believe that the generated samples G(z)
are real. Goodfellow, et al. (2014, 2016) propose that training such a network in practice is not
ideal. In the initial phase of training, the discriminator minimizes the cross-entropy on a
combination of real and fake samples. As a result, the generator has no loss function or gradient
to minimize the same cross-entropy (i.e., vanishing gradients). Therefore, Goodfellow, et al.
(2014, 2016) propose that the generator G needs to maximize log(D(G(z))) instead of
minimizing log(1 – D(G(z)). This enables the generator to have a more stable gradient to
maximize the cross-entropy (see Goodfellow, et al. 2016 for details). To illustrate this problem,
in the initial phase of training G generates very poor samples since it just samples from random
noise thus the discriminator is very accurate in separating the real samples from the fake
samples. In this situation, D predicts all the classes correctly and log D(G(z)) is close to zero.
In this situation, G has no loss function to minimize and is unable to generate realistic samples.
By maximizing log(D(G(z))), the gradient of the loss function is less likely to saturate in a
situation where D is highly confident. This leads to the formal cross-entropy cost or negative
log-likelihood of the generator, where again we apply the log function to prevent the gradient
from saturating:
𝐽
(𝐺)= −
1𝑚
𝔼
𝑧[log 𝐷(𝐺(𝑧))]
(4)
Here, G has the objective to maximize the cross-entropy of D over G(z), over m mini-batches.
Both the loss function of G and D are heavily adapted throughout the literature of GANs (e.g.,
Martinez & Kamalu, 2018: Kumar, Biswas & Sanyal, 2018).
3.3 Gradient descent and learning rate
17
Bishop, 2006: Goodfellow, Bengio & Courville, 2016). Back-propagation aims to derive
weights in each layer of the network that ensure that for a specific input vector, the output
produced by the function is the same or close the desired output. The difference between the
actual output vector and desired output vector is minimized by taking the partial derivative
(gradient) of this error with respect to each weight in the network (Rumelhart, Hinton &
Williams, 1986). The concept of a gradient of a loss function is depicted in Figure 2. The
weights of the network are updated taking a step in the opposite direction of the sign of the
gradient. When the gradient is positive the weights are tuned more negatively and vice versa to
reach the global minimum, see equation (5).
In this example, the gradient is negative, thus the weights are updated in a positive direction.
This step is often referred to as the learning rate α (Ruder, 2016: Goodfellow, Bengio &
Courville, 2016).
𝑥
′= 𝑥 − 𝛼 ∇
𝑥𝑓(𝑥)
(5)
Here, x represents the weight,
𝛼 represents the learning rate, ∇
𝑥𝑓(𝑥) the gradient of a loss
function and
𝑥
′the updated weight. Notice that the direction of the step is learning rate is
opposite of the sign of the gradient, as the minus changes the sign of the gradient. In Figure 2
the gradient is negative, which results in an increase of the weights. Intuitively, it is important
to pick a reasonable value for this step. If the learning rate is too small, finding the global or
local minimum takes ample training iterations. If the learning rate is too large, the optimizer
never finds a local or global minimum of the loss function. In practice for deep neural networks,
it is not given to arrive at a global minimum and often a local minimum or even a saddle point
is found. Choromanska, Henaff, Mathieu, Arous, LeCun (2014) show that when increasing the
18
number of hidden layers, getting stuck in a local minima is less of an issue as the performance
of the network does not differ much from when a global minimum is found. Intuitively, one
could imagine that when restricting the network to find the global minimum, the network is
very likely to overfit.
Ruder (2016) describes that stochastic mini-batch gradient descent (SGD) based training
algorithms are applied to find the global minimum of a loss function. Here, stochastic refers to
the fact that the data, to minimize the loss function, is drawn randomly. Mini-batch signifies
the fact that the data fed to minimize often consists of a random sample of 50 to 256
observations from the original data set. Goodfellow, Bengio & Courville (2016) describe the
fact that the weights are updated by a stochastic mini-batch of data makes the loss function
differ each time a mini-batch is sampled. This iterative process drives the loss function to a
minimum. Therefore, it is not always required to arrive at a global minimum due to the iterative
nature of the stochastic gradient procedure. In addition, it would be computationally too
expensive to restrict an optimizer to only satisfy for a global minimum, especially when the
loss function is represented in a high-dimensional space.
3.4 Optimization algorithms
The question that remains from the previous paragraph is: “What is a good learning rate for my
loss function?”. Naturally, a frequently occurring answer would be: “It depends”. Goodfellow,
Bengio & Courville (2016) recognize the importance and difficulty of finding the correct value
for this hyperparameter, since it has a significant effect on model performance. The authors
describe several adaptive optimization algorithms: AdaGrad, RMSprop and Adam. These
methods all have in common that the learning rate is adapted to the value of the gradient. One
should understand that the larger the gradient, the smaller the learning rate for the next step.
Intuitively this makes sense, the larger the gradient, the steeper the slope, the closer the gradient
is to the global or local minimum of a function, see Figure 2.
19
to the choice of hyperparameters in the neural network, compared to non-adaptive SGD
algorithms. Here, hyperparameters refer to the overall architecture of the neural network (e.g.,
the number of layers, the activation function, loss function, dropout, amount of regularization).
Therefore, these optimizers are preferred over the non-adaptive SGD optimizers in neural
networks or GANs (Beaulieu, et al., 2018: Karras, et al. 2018: Kumar, Biswas & Sanyal, 2018).
Due to these arguments, the generator and discriminator use the Adam optimization algorithm
in the architecture to minimize the loss function (see section 4.3.1).
3.5 Activation functions
An activation function transforms the weighted summed inputs of the network, into a
probability (Leeflang, Wieringa, Bijmolt & Pauwels, 2017). Therefore, the activation function
is usually referred to as the output layer of a neural network (Goodfellow, Bengio & Courville,
2016). The design of these functions is an active field of research and do not enjoy a highly
theoretical background yet.
3.5.1 Discriminator
The objective of the discriminator is to distinguish between the fake from the real samples. A
popular activation function in GANs for the discriminator is the sigmoid or logistic function
(Goodfellow, Bengio & Courville, 2016).
𝜎(𝑥) =
1(1+𝑒−𝑥)
(6)
20
3.5.2 Generator
Goodfellow (2016) describes that the generator has very few restrictions on the design. In
practice, the output layer of the generator generally consists of the hyperbolic tangent or tanh
activation function, see equation (7). The usage of this activation function is rather practically
motivated than theoretical (e.g., Karras, et al. 2018: Kumar, Biswas & Sanyal, 2018). Kumar,
Biswas & Sanyal (2018) & Karras, et al. (2018) employ data normalization so that the generated
fake data are in a range of (-1, 1). This property makes the tanh function most appropriate for
the output layer, since z
∈
(-1,1).
tanh(𝑧) =
𝑒𝑧−𝑒−𝑧𝑒𝑧+𝑒−𝑧
(7)
3.5.3 Rectified linear unit
The main disadvantages of the previously described activation functions is saturation. This
makes gradient-based learning for the hidden layers very difficult. For this reason, activation
functions that suffer from saturation are excluded from the hidden layers in the architecture of
a neural network.
Nair & Hinton (2010) introduced rectified linear units (ReLU) as a method
for hidden layers to not saturate for extreme input values, see equation (8).
𝑦
𝑖= {
𝑥
𝑖if 𝑥
𝑖≥ 0
0 if 𝑥
𝑖< 0
(8)
Where
𝑥
𝑖is the input value of the weights and
𝑦
𝑖the output. The authors show that the
21
𝑦
𝑖= 𝑥
𝑖𝑦
𝑖= 0
𝑦
𝑖= 𝑥
𝑖𝑦
𝑖= 𝑎
𝑖𝑥
𝑖3.5.4 Leaky ReLU
Maas, Hannun & Ng (2013) proposed Leaky Rectified Linear Units to overcome this problem
by expanding the range of the rectified linear unit, see Figure 3. Where 𝑎
𝑖is a hyperparameter
in the range of (0, +∞). This property allows for a small gradient when the unit is not active
(i.e., 𝑥
𝑖= 0). Xu, Wang, Chen & Li (2015) propose a value of 5.5 for 𝑎
𝑖. While, Radford, et al.
(2015) propose a value of .2. This hyperparameter gives the neuron the opportunity to recover
from the inactive status. As visible in Figure 3, the Leaky ReLU does not suffer from the
problem by multiplying the input value with the defined hyperparameter. Therefore, the Leaky
ReLU is a more effective learning function in the hidden layers of a neural network (Xu, Wang,
Chen & Li, 2015). Therefore, leaky ReLU is regarded as the standard for the architecture of
state of the art GANs (e.g., Ledig, et al. 2016 and Karras, et al. 2018).
3.6 Training procedure
The GAN starts training by combining a randomly generated mini-batch, denoted as m, of data
z, where z is drawn from a normal distribution (z ∼ N(0,1))
,with a mini-batch of the real data
x, see Algorithm 1 adapted from Goodfellow, et al. (2014). The combined batch of data is used
to train the discriminator D with the objective to identify the fake samples. To reach their
objective, the discriminator and generator attempt to minimize the cross-entropy (Goodfellow,
2016). To minimize the cross-entropy, the two networks (D and G) simultaneously apply
mini-batch stochastic gradient descent (SGD) in the function space in an attempt to find a global
minimum or local minimum (see section 3.3). Goodfellow (2016) recommends the usage of the
gradient-based optimization algorithm Adam (Kingma & Ba, 2014). This procedure allows the
discriminator to optimize weights, improve the prediction accuracy and the generator to
22
construct more realistic samples. If both models have sufficient capacity, the competition
between the two networks converges when Nash equilibrium is accomplished.
Nash equilibrium is defined as a state where, two players do not gain from deviating from their
strategies (Nash, 1950). For example, in a game between the discriminator and generator, Nash
equilibrium is reached when both networks do not gain much from adjusting the weights to
minimize the loss function. In this state, the generator produces highly realistic samples that
make the discriminator unable to separate the real x from the fake samples z (i.e., D(x) = .5).
Where in practice the stochastic gradient descent of D(G(z)) usually is performed for the
generator, contrary to 1 - D(G(z)) (see section 3.2.2).
3.7 Non-convergence
To reach this state of Nash equilibrium has found to be very difficult and is subject to ongoing
research (Radford, et al. 2015: Salimans, et al. 2016: Arjovsky, Chintala & Bottou, 2017).
3.7.1 Mode collapse
23
explanation of why this problem occurs is the lack of diversity in the mini-batches that are
provided to the discriminator, while the real distribution has a higher level of diversity. Here,
the discriminator trains on a low in diversity mini-batch, which implies that the generator only
has to generate a limited number of diverse samples to fool the discriminator. In the next
iteration of training, the discriminator has a different low diversity mini-batch and the generator
adjusts the weights accordingly. This prevents the minimax game from converging
(Goodfellow, 2016: Salimans, et al. 2016).
3.7.2 Evaluation of training
Theis, Oord, & Bethge (2015) attempt to define a measure to evaluate the approximation of
distributions. The authors conclude that there is not a single measure to evaluate the
performance of a generative model. Low cross-entropy or high likelihood does not mean that
the data samples from the generator are of high quality, as low likelihood does not imply that
the data samples are of low quality (Goodfellow, 2016). Consider mode collapse as a result of
this scenario. Theis, Oord, & Bethge (2015) conclude that currently there is no state-of-the-art
method or measure to evaluate the performance of a GAN during training.
To the contrary, metrics have been developed to evaluate the generated fake samples after
training a GAN. For GANs that have the objective to specifically generate images, the Inception
Score (IS) has been developed (Salimans, et al. 2016). Often in practice, the loss function that
is being minimized is analysed during training since this value should be minimized for both
the discriminator and generator (e.g., Ledig, et al. 2016: Karras, et al. 2018). This minimization
is no guarantee for realistic samples, thus the generated fake samples should always be
compared with the real data.
3.7.3 Discrete outputs
As described in section 3.1, the generator must be differentiable. This imposes the limitation of
a GAN to generate discrete data outputs (Goodfellow, 2016). Nonetheless, the values approach
asymptotically the discrete outputs.
3.8 Developments towards a stable GAN
24
3.8.1 One-sided label smoothing
Consider a case where the discriminator minimizes the cross-entropy between the model and
data distribution (see section 3.2). The discriminator has a tendency to minimize its loss
function very rapidly, which leaves no gradient for the generator. One-sided label smoothing
enables the discriminator to be less confident about its predictions (Szegedy, Vanhoucke, Ioffe,
Shlens & Wojna, 2016). To accomplish this, the labels in the data set of the mini-batches are
transformed. Instead of a zero and one indicating whether the labels are fake or real, the labels
are represented by .9 or .1 (Salimans, et al. 2016).
3.8.2 Batch normalization
Ioffe & Szedegy (2015) introduced batch normalization, which transforms the mini-batches in
having a mean of zero and a standard deviation of one. The transformation occurs after every
layer so that the data as input for the subsequent layer is normalized. Intuitively, as the scale of
the data decreases, the scale of the loss function decreases. Therefore, the method reduces the
dependency of the gradients on the scale of the data, allows the network to employ a higher
learning rate thus faster convergence and reduces the need for dropout. LeCun, Bottou, Orr &
Müller (1998) showed that a normalization method greatly increases the speed of training a
neural network. Radford, et al. (2015) showed that the batch normalization prevents the
generator from showing symptoms of mode collapse.
3.8.3 Dropout
To prevent the discriminator to overfit, the architecture of the GAN employs dropout in the
layers. Srivastava, Hinton, Krizhevsky, Sutskever & Salakhutdinov (2014) propose the key idea
of dropout as weights that are randomly dropped to zero in a neural network. This idea induces
stochasticity in the network and prevents the network from overfitting. Intuitively, this is similar
to ensemble methods, as one neuron drops from the architecture a completely different network
arises. This leads to a the creation of an ensemble of sub-networks (Goodfellow, Bengio &
Courville, 2016: page 260). These authors showed that dropout improved the performance of
neural networks dramatically. Nowadays, GANs benefit from dropout in the architecture of the
discriminator (Isola, et al. 2016).
3.8.4 Wasserstein GAN
25
measures the amount of cost, referred to as “dirt”, it takes to transform the initialized
distribution to approximate the real data distribution (e.g., Gaussian). Intuitively, the dirt is
measured by multiplying the mass of the distribution by the distance it needs to travel to
approximate the real data distribution (Arjovsky, Chintala & Bottou, 2017). Deriving from the
formulation of the Earth-Mover distance, the loss functions of the discriminator and generator
are (Arjovsky, Chintala & Bottou, 2017: Algorithm 1):
𝐽
(𝐷)= − 𝔼
𝑥~ℙ𝑟[𝑓
𝑤(𝑥)] − 𝔼
𝑧~𝑝(𝑧)[𝑓
𝑤(𝑔(𝑧))] (9)
𝐽
(𝐺)= − 𝔼
𝑧~𝑝(𝑧)
[𝑓
𝑤(𝑔(𝑧))] (10)
In contrast to equation (1), here 𝔼
𝑥~ℙ𝑟[𝑓
𝑤(𝑥)] is defined to be a sample from the real data
distribution predicted to be real by the discriminator (D) with a linear activation function.
𝔼
𝑧~𝑝(𝑧)[𝑓(𝑔(𝑧))] is defined as of D over G(z) indicating how to which degree G(z) real is. The
objective is to maximize equation (9) by training the discriminator, while the objective is to
minimize equation (10). Notice the minus sign in equation (9), this sign switch allows us to
minimize the loss instead of maximizing, which allows us to use optimization algorithms such
as Adam. Contrary to equation (1), the saturation effect of the sigmoid is avoided and the log
function is removed from the loss functions. Due to these features, the authors refer to the
discriminator as a critic instead of a detective, as the discriminator does not longer determine
whether the samples are real or fake, but whether to what degree the samples are fake or real.
The authors provide proof that when G is continuous, which in definition is the case by letting
G be a neural network, the Wasserstein distance has the guarantee of being differentiability and
continuous almost everywhere in contrast to the Kullback-Leibler divergence. To illustrate this,
imagine two uniform distributions. When the distance is very large between these two
distributions, the Kullback-Leibler divergence approaches infinity or zero, which implies that
we are unable to minimize the distance by means of gradient descent. The Wasserstein distance
does not have this property, since it is defined by a differentiable finite Earth-Mover distance
(Arjovsky, Chintala & Bottou, 2017: page 4).
26
trained until convergence and G is still able to catch up. Intuitively, a converged D is assumed
to be optimal thus is able to give the most accurate gradient to G. Whereas, in case of the
Kullback-Leibler divergence, we needed to account for a delicate balance between D and G,
see section 3.7. Consequently, the authors empirically show that the convergence of the GAN
is more stable. During their empirical study, the authors did not find any evidence for mode
collapse during training (Arjovsky, Chintala & Bottou, 2017). Another advantage of the
Wasserstein distance is the ability to provide a meaningful loss metric that correlates with the
quality of the generated images. Nonetheless, to the day of writing this property has not been
investigated for one-dimensional data generation, especially in a marketing context.
4. Research design
4.1 Data description
The real churn data set is provided by an insurance provider in the Netherlands. Whereas, the
market data set is provided by well-known supermarket chains in the Netherlands.
4.1.1 Customer data
The customer data in this study consists of two churn data sets. The first data set is an artificial
churn data set of 3,333 observations, which is freely available on the internet. The data set
contains the variables described in Table 1.
Table 1, artificial churn data set variables.
Variable Scale Description
Account Length Ratio The number of months a customer has a contract. International Plan Nominal Whether the customer has an international plan. Voicemail Plan Nominal Whether the customer has a voicemail plan. Voicemail Message Integer The number of voicemail messages.
Day Min. Continuous The number of minutes called during the daytime.
Day Calls Integer The number of calls during the daytime.
Day Charge Continuous The amount charged during the daytime.
Eve Min Continuous The number of minutes called during the evening.
Eve Calls Integer The number of calls during the evening.
Eve Charge Continuous The amount charged during the evening.
Night Mins Continuous The number of minutes called during the evening. Night Calls Integer The number of calls during the evening.
27
Int. Min. Continuous The number of international minutes called. Int. Calls Integer The number of international calls.
Int. Charge Continuous The amount charged from international calls. Cust. Serv. Calls Integer The amount of customer service calls.
Churn Nominal Whether the customer churned.
The second data set is a real churn data set from an anonymous insurance company in the
Netherlands, which consists of 1,262,423 observations. Table 2 describes the variables that are
present in the data set.
Table 2, real churn data set from a telecom provider in the Netherlands variables.
Variable Scale Description
Churn Nominal Whether the customer cancelled the contract.
Gender Nominal Male or female.
Age Continuous The number of years a customer has lived.
Rel. duration Continuous The duration of the relationship.
Collective Nominal Whether a customer is part of an insurance collective. Size of Policy Categorical The size of the policy.
AV 2011 Categorical Additional insurance package of the customer. Complaints Categorical The number of complaints.
Contact Integer The number of contacts the customer made.
Distance to store Categorical The distance to a store.
Address size Categorical The size of the house of a customer.
Incoming contacts Nominal Whether the customer contacted the insurance.
# incoming contacts Integer The number of incoming contacts of the customer to insurance. AV cancellation Nominal Whether a customer has cancelled the additional insurance.
Defaulter Nominal Whether somebody had trouble paying in the past.
Urbanity Categorical The urbanity of where the customer lives. Social class Categorical The social class of the customer.
Stage of life Categorical The stage of life of a customer.
Income Categorical The income that a customer receives.
Education Categorical The level of education a customer received.
“BSR.groen” Nominal An additional insurance package.
“BSR.rood” Nominal An additional insurance package.
Without children Nominal Whether a customer had children. Payment method Nominal The payment method of a customer.
Declared Nominal Whether a customer has declared any value.
28
Declaration amount Continuous The amount of declarations in euros.
4.1.2 Market data
The market data set consist of lemonade sales of brands in different supermarket chains from
the Netherlands. This data set consists of 4,858 observations. Additional weather data is
collected from the KNMI and Google Trends. Table 3 presents the variables that are in the data
set.
Table 3, real market data set.
Variable Scale Description
Date Categorical The date at the time of sales. Year Categorical The year at the time of sales. Quarter Integer The year at the time of sales.
Week Integer The week at the time of sales.
Chain Categorical The supermarket chain.
Brand Categorical The brand of lemonade.
Unit Sales Integer The units of lemonade sold.
Price PU Continuous The price of the lemonade with the promotion. BasePrice PU Continuous The price of the lemonade without promotion. FeatDispl Integer % of stores with feature and display promotion. DispOnly Integer % of stores with display promotion.
FeatOnly Integer % of stores with feature promotion. Promotion Continuous % of discount.
Revenue Continuous The amount of revenue in euros.
MinTemp Continuous The minimum temperature in Celsius at De Bilt. MaxTemp Continuous The maximum temperature in Celsius at De Bilt. Sunshine Continuous The minimum temperature in De Bilt.
Rain Continuous Duration of rain in .1 hour. KarvanC. Go Continuous Google Trends index (0-100).
4.1.3 Data cleaning and missing values
29
Table 4, missing values and imputation methods
Variable Missing (%) Imputation method
Rel. duration 4.49 MI (decision tree)
AV 2011 23.89 MI (decision tree)
Distance to store 1.88 Listwise deletion
Urbanity .07 Listwise deletion
Social class 1.69 Listwise deletion
Income 9.78 MI (decision tree)
Education 1.69 Listwise deletion
“BSR groen” 47.02 MI (decision tree)
“BSR rood” 47.02 MI (decision tree)
Without Children 8.61 MI (decision tree)
Payment method .003 Listwise deletion
Declarations amount <.001 Listwise deletion