Generative adversarial networks in marketing: Overcoming privacy concerns with the generation of fake data. Gilian Ponte S2591634

(1)

Generative adversarial networks in marketing:

Overcoming privacy concerns with the generation of fake data.

(2)

(3)

Generative adversarial networks in marketing:

Overcoming privacy concerns with the generation of fake data.

Gilian Ponte

S2591634

Master Thesis

Supervisor: Prof. dr. J. Wieringa

Second supervisor: dr. K. Dehmamy

Rijksuniversiteit Groningen

Faculty of Economics and Business

Department of Marketing

PO Box 800

9700 AV Groningen

(4)

(5)

Abstract

Privacy is a fundamental human right. Over the years the right to privacy has come under

pressure by the growth of the internet, methodologies and amount of data. These privacy

concerns have led to individuals reacting negatively to the collection and usage of individual

data. The development of the methods that pressure the fundamental right to privacy, are now

advanced to alleviate these same privacy issues. The recent developments surrounding

generative networks allow the generation of individual fake data based on any real data

distribution. This study is the first empirical attempt to alleviate privacy concerns by means of

a generative adversarial network in the field of marketing. Consequently, for three data sets a

generative adversarial network and a Wasserstein generative adversarial network are

developed. These networks successfully generate fake data that is useful in marketing modeling

cases. Surprisingly, this study shows that estimations on fake data are able to outperform

estimations on real data. The current study specifies that academics and firms are able to

generate fake data to alleviate privacy concerns among individuals, promote data sharing and

even advance the development of theory in all academic disciplines.

(6)

Preface

I started my academic career at the Rijksuniversiteit Groningen as a pre-master student in

marketing. During this first year I was uncertain whether I was capable of ever doing a master

during the course of my life. Nonetheless, I believed and convinced myself that when you really

want something, you are always capable of doing so. Two years later, this paper is the result of

a very enjoyable period of time as a MSc Marketing Intelligence student.

During this academic year, I was able to feed my curiosity with courses such as: Data Science

& Marketing Analytics, Market Models, Digital Marketing Intelligence and Customer Models.

These courses provided me a thorough background in modeling. Specifically, during these

courses my interest was directed at the recent developments in artificial intelligence and deep

learning. Over the course of my masters, the development of generative networks gained

exponential attention within and outside the literature of artificial intelligence. I remember

asking Prof. dr. Wieringa during one of these courses whether the marketing literature has

investigated the applications of GANs in a privacy setting. Privacy is an issue, in and outside

marketing, that affects a substantial amount of people. Personally, the development of privacy

protective methods have not gained adequate attention throughout the years. These

developments led to the subject of my master thesis.

First of all, I would like to thank Prof. dr. Wieringa for the elaborate support during and outside

the process of writing my thesis. For example, the support in the application process for the

Research Master and possibly succeeding PhD. I am hoping and looking forward to collaborate

in research on the development of privacy preserving methods in the future. Secondly, I would

like to thank my family for supporting me in many ways, raising me with a persevering and

curious attitude and for always believing in me. In discussions my family was always able to

shed a different light on my research. Finally, I would like to thank my friends with special

gratitude towards: Hidde Smit, Diede Wieldraaijer, Mats Neeft, Felix Lehmkule, Muthia

Khanza and Wisse Smit for all the support and fun I enjoyed from our friendship over the years.

Gilian Ponte

(7)

1. Introduction ... 1

1.1 Practical and academic contributions ... 2

1.2 Research question ... 5

1.3 Structure of the study ... 5

2. Theoretical framework ... 5

2.1 The ability of GANs to generate customer data ... 5

2.2 Privacy issues ... 8

2.3 Marketing modeling ... 9

3. Generative Adversarial Networks ... 13

3.1 Formal objective ... 13

3.2 Loss functions ... 14

3.3 Gradient descent and learning rate ... 16

3.4 Optimization algorithms ... 18

3.5 Activation functions ... 19

3.6 Training procedure ... 21

3.7 Non-convergence ... 22

3.8 Developments towards a stable GAN ... 23

4. Research design ... 26

4.1 Data description ... 26

4.2 Data normalization ... 30

4.3 GAN - architecture ... 30

4.4 Wasserstein GAN - architecture ... 36

5. Results ... 39

5.1 The correlation matrix of the fake data set correlates with the real data correlation

matrix. ... 39

(8)

5.3 The predictive accuracy of machine learning techniques is significantly lower on fake

data than on real data. ... 46

5.4 The addition of generated fake data to real data will significantly increase the predictive

accuracy of machine learning techniques. ... 50

5.5 The parameters are equal between the model based on generated fake data and an

estimation on the real data. ... 51

5.6 The effect of generated data on the MAPE, Theil U-statistic and RAE. ... 63

6. Discussion ... 67

7. Conclusion and future research ... 70

7.1 Conclusion ... 70

7.2 Limitations ... 71

7.3 Future research ... 72

References ... 75

Appendices ... 85

Appendix 1: Generative adversarial networks code ... 85

Appendix 2: Analysis of fake and real data code ... 109

Appendix 3: Wasserstein GAN ... 201

(9)

1

1. Introduction

Marketing has a rich history of modeling data in efforts to understand customer behaviour and

the effectiveness of marketing channels. Wedel & Kannan (2016) present an outline of the

timeline of marketing data and analytics, from survey data in 1900 accompanied by OLS and

ANOVA to the development of methods for social and location data in the present (e.g., Nam

& Kannan, 2014: Büschken & Allenby, 2016). As new types of data became available, the

development of new methods naturally followed. Parallel to the emergence of new data types

and methods, the volume of data increased which is often referred to as one of the characteristics

of big data (Sarioglu & Sinanc, 2013: Wedel & Kannan, 2016). The rise and popularization of

the internet and emergence of social media have been a considerable game changer for

companies and marketing academics to collect data and a resource to rich data sets containing

detailed information on individual activities of users (Bucklin & Sismeiro, 2009). Nevertheless,

the growing volume of data comes with a set of disadvantages. Challenges in privacy, data

sharing, storage, computation power and noise in data are a reality for every company or

academic when handling big data (Leeflang, Wieringa, Bijmolt & Pauwels, 2017).

Yann LeCun (2018) who is considered to be one of the founding fathers of the field of artificial

intelligence described a new exciting methodological development: “Adversarial training is the

coolest thing since sliced bread” or “The most exciting thing in Deep Learning”. The

introduction of generative adversarial networks (GAN) by Ian Goodfellow, et al. (2014) was

received with much enthusiasm by the artificial intelligence community. While marketing has

a history of employing techniques from computer science or the field of artificial intelligence,

generative adversarial networks have not been studied yet in a marketing context. One could

find this quite striking considering the potential solutions it might offer to the challenges that

are present in the field of marketing. Especially, since a GAN has the potential to solve or

alleviate the privacy issues that are present in the field of marketing.

(10)

2 model-based approaches for privacy protection in marketing. The current study is an attempt to

empirically explore the ability of generative adversarial network to generate customer data

while preserving customer privacy. This study shows that the generative adversarial networks

are able to generate fake data that can be used to alleviate privacy issues and enhance data

sharing among academics and practitioners.

1.1 Practical and academic contributions

In the field of marketing, firms annually spend around 36 billion dollars to capture and leverage

customer data (Columbus, 2014). The vast amount of investment in leveraging customer data

combined with the growth of data and possibilities to capture individual customer data led to

privacy concerns (van Doorn & Hoekstra, 2013). Consequently, the European Union and the

United States created legislation to regulate and protect the individual data of customers

(European Parliament, 2013: PCAST, 2014). A practical example of these developments are

the recent events concerning Facebook and Cambridge Analytica. Cambridge Analytica was

seeking for a way to enrich customer data for modeling purposes. Data sharing enables

companies to collaborate and enrich data sets to get a better customer view and enhance the

performance of customer models (Peltier, et al. 2013). Therefore, Cambridge Analytica and

Facebook shared data of more than 87 million users (Reuters, 2019). The risks of such

endeavours are clear, companies are at risk for potential privacy costs, losses in brand value,

legal fines or customer trust (Schneider, Jagpal, Gupta, Li & Yu, 2017). For Facebook, the data

breach led to a decrease of more than 119 billion dollars’ worth of Facebook in market value

due to the exposure of data from the 87 million Facebook users (Neate, 2018). To illustrate the

magnitude of such an event, the total damage in stock value was equal to the value of

McDonald's at the time. The scandal led to the termination of the firm Cambridge Analytica.

Nowadays, Facebook continues to suffer from the consequences of the data breach. The FTC

continues to investigate Facebook and the company may face a multibillion-dollar fine

regarding the Cambridge Analytica events (Reuters, 2019).

(11)

3 dimensionality. Leeflang, Wieringa, Bijmolt, & Pauwels (2017) describe machine learning

techniques that are better able to deal with these shortcomings. Therefore, when handling big

data marketing research needs to rely more on recent estimation techniques and modeling

approaches from other fields as machine learning.

The advantages of these methods were not immediately evident among academics and

practitioners. The development of machine learning has a long history of setbacks and progress.

In 1950, Alan Turing, the British famous mathematician, questioned himself whether machines

would be able to think, thus proposing the idea of a machine that is able to learn and become

artificially intelligent (Turing, 1950). As a result, only a year later, one of the first machines

that were able to learn from data were developed, introducing a new field of research called:

‘Artificial Intelligence” often referred to as AI (McCarthy, 1974). Around the 1990‘s the

discovery of backpropagation by Rumelhart, Hinton & Williams (1986), the development and

rise of the internet, increasing computational power and the vast amount of data contributed to

the increase of interest and funding into AI research. The field of AI started to show promising

results and even surpass human performance in specific tasks. For example, the development

of the Google Search Engine which enables users to search the internet. DeepMind developed

Alpha Go Zero that defeated the world champion in the Chinese ancient game of Go (Silver, et

al. 2017). Nowadays, machines are even able to defeat humans in video games (Vincent, 2018:

DeepMind, 2019). The introduction of the idea to use the architecture of the brain to create

machines that are able to learn has, as we now know, great consequences in almost every

academic field.

(12)

4 is used for empirical studies often consists only of one field or company. So that, empirical

results often are limited to one industry, which requires additional research to generalize the

results. The fake data generated by a GAN could be shared between academics and practitioners

since it is not privacy sensitive.

However, from using this methodology potential issues arise. First of all, machine learning,

especially generative modeling, could be regarded as more engineering-oriented. It is a

hands-on discipline in which new ideas are proven empirically more often than theoretically.

Especially, the architecture of a GAN is often derived from practical experience and

engineering rather than theory. Goodfellow, et al. (2014) confirms this notion by stating that a

great deal of hyper parameterization is required to successfully generate data. Secondly,

Goodfellow, et al. (2014) identifies that GANs are hard to train and converge. The successful

convergence of a GAN is a subject of ongoing research. Moreover, GANs do not have a

measure to monitor while training your model, which makes identifying at which state the GAN

generates realistic data samples is considered to be a challenge. The convergence of a GAN is

often assessed by subjectively looking at the fake samples, instead of a single objective metric.

Finally, Wedel & Kannan (2016) raise a legitimate question of whether marketing academics

should adopt machine learning and neglect traditional methods, because machine learning

techniques do not yet establish causal effects or generalizable theoretical insights. This is an

active stream of research (e.g., SHAP by Lundberg & Lee (2017)). Therefore, machine learning

techniques, especially neural networks, are often considered “black boxes”. The interpretability

of these machine learning methods is limited. Whereas, more traditional methods are able to

identify the effects on a dependent variable (e.g., logistic regression or linear regression).

(13)

5 1.2 Research question

To conclude, the goal of this study is to investigate what the consequences of fake data are in

marketing modeling purposes. The overall goal consists of the following sub-research

questions. First, the ability of a GAN to generate data is investigated resulting in the following

research question: “How does a GAN generate fake data?”. As mentioned, the creation of fake

data based on the distribution of real data could mitigate privacy issues, summarized in the

second research question, namely: “Is a GAN able to alleviate privacy issues by the generation

of fake data?”. Evidently, the performance of the generated fake data needs to be comparable

or outperform the real data in marketing modeling. Therefore, the last research question is

defined as: “What is the effect of fake data compared to real data in the predictive accuracy of

marketing models?

1.3 Structure of the study

The remainder of this study is organized as follows: Chapter 2 touches upon the current

literature on GANs and their implications for marketing modeling exercises. Chapter 3 provides

a detailed overview of the methodology developed in this study, with special attention for the

current state-of-the-art GAN architectures. Chapter 4 describes the processes that led up to the

development of the GAN and the fake data. Followed by chapter 5, that elaborates on the results

and the tested hypothesis. These results are discussed in chapter 6 and chapter 7 gives an

overview of limitations and future research recommendations.

2. Theoretical framework

2.1 The ability of GANs to generate customer data

(14)

6 indicate that the importance and influence of techniques from AI in the marketing literature is

unambiguous.

LeCun, Bengio & Hinton (2015) continued to develop the concept of neural networks and

introduced the field of deep learning. Deep learning encompasses neural networks with, instead

of one processing layer, multiple processing layers (i.e., a higher level of abstraction). This

development sparked a stream of research into deep neural networks that dramatically improved

the performance in the tasks of speech-recognition, visual object recognition, image

classification, topic classification, sentiment analysis, question answering and language

translation and processing videos (Krizhevsky, Sutskever & Hinton, 2012: Bordes, Chopra &

Weston, 2014: Sutskever, Vinyals & Le, 2014: LeCun, Bengio & Hinton, 2015). Whereas,

recurrent neural networks showed improvements in data with long-range dependencies. For

example, the analysis of sequential data such as text and speech (Graves, Mohamed & Hinton,

2013).

The introduction of generative adversarial networks or GANs by Goodfellow, et al. (2014) has

shown promising results in the field of generating images, videos and audio data. Probably the

most recent and well-known example is the generation of celebrity faces by Karras, Aila, Laine

& Lehtinen (2018). The authors generated fake high-resolution images of celebrity faces that

are indistinguishable from real images of faces. Lotter, Kreiman & Cox (2015) show that GANs

are able to generate the next frame in a video sequence. Another application is to generate a

high-resolution image from low-resolution images (Ledig, et al. 2016). Isola, et al. (2016) show

that GANs are very creative. The authors demonstrate the ability of a GAN to fill in the colour

of sketches of images that corresponds to the ground truth. For example, a sketch of a handbag

is filled in with the colour brown which resembles the original image.

(15)

7 The authors suggest that in the future, GANs could be applied to other kinds of e-commerce

tasks such as targeting, product recommendation or the simulation of future events. In a case

for medical trial data, Beaulieu-Jones, et al. (2018) describe the successful creation of

participant data using a GAN to facilitate data sharing.

Specifically, Goodfellow (2016) describes that a GAN approximates a density surface in a

high-dimensional space. Intuitively, the high-high-dimensionality relates to the number of variables in a

data set, where all the distributions of variables and relationships between variables are

represented by a surface. Empirically, it is interesting to investigate whether the GAN is capable

to approximate a real surface in a high-dimensional space with high accuracy. For simplicity

reasons, this study refers to the variables in a one-dimensional space. The main motivation for

this choice is the curse of dimensionality described by Goodfellow, Bengio & Courville (2016).

As the number of variables increase, the dimensionality of the density surface increases, which

leads so several statistical challenges. For a detailed description of these statistical challenges

this study refers to Goodfellow, Bengio & Courville (2016). Therefore, the current study

empirically investigates whether the one-dimensional distributions and the correlations within

the real data set are contained in the fake data set. These two conditions are required and often

serve as a proxy to measure and compare two surfaces in a high-dimensional space. Both studies

confirm that the correlations among the generated data variables significantly correlate with the

correlations among the real data set (Beaulieu-Jones, et al. 2018: Kumar, Biswas & Sanyal,

2018). To conclude, this leads to the first hypothesis:

H1: The correlation matrix from the fake data set significantly correlates with the real data

correlation matrix.

As described, a GAN approximates the probability distributions of the real data set in a

high-dimensional space. Therefore when a GAN converges, the generated data distributions

resemble the real data distributions, which is partially confirmed by other studies but has never

been statistically tested (Goodfellow, et al. 2014: Beaulieu-Jones, et al. 2018: Kumar, Biswas

& Sanyal, 2018). Therefore, a sub-hypothesis is developed as follows:

(16)

8 2.2 Privacy issues

The literature is conclusive regarding the effects of privacy in marketing. Customers respond

negatively to a firm collection and use of individual data (Doorn & Hoekstra, 2013: Martin,

Borah & Palmatier, 2017). Martin, Borah & Palmatier (2017) describe that a data breach leads

to significant negative stock performance and even spillover effects on the value of other (rival)

companies. However, transparency and control promises positively mediate this effect. This

makes privacy issues not only relevant for companies and marketing academics, but one of their

highest priorities (Kannan & Li, 2017: Marketing Science Institute, 2018).

To potentially deal with privacy issues, Wedel & Kannan (2016) identify two potential actions

to take: data minimization and data anonymization. Data minimization refers to limiting the

amount of data marketers collect and dispose when the marketers do not longer need the data.

This counteracts the notion of generating generalizable results, where rich and data high in

volume is required. Data anonymization is accomplished by k-anonymization, removing

personal identifiable information, recoding, swapping or randomizing data or hashing

algorithms (Reiter, 2010). Nevertheless, in case of a data breach, the data is still considered to

be privacy sensitive (Miller & Tucker, 2011).

Wedel & Kannan (2016) propose that marketing analytics should develop procedures to find a

balance between minimization, anonymization and the degrading diagnostic and predictive

power. A GAN could be considered as the perfect example of a data anonymization technique

that does not minimize the data. Instead of minimization, it generates data based on the real

data distribution what leads to, in theory, very useful data for modeling practices (Goodfellow,

et al. 2014: Wieringa, et al. 2019). Moreover, since the data is created based on a real data

distribution, it is not possible to trace individual real customers back in the fake data in case of

a data breach.

(17)

9 an individual level while ensuring individual privacy protection. Instead of focussing on the

methodology to mitigate privacy issues in the study of Holtrop, et al. (2017), this study focusses

on the generation of fake data with a GAN to preserve the privacy of consumers.

Schneider, Jagpal, Gupta, Li & Yu (2018) describe that the generation of fake data provides an

important advantage over other privacy protection measures. Namely, the generation allows

theoretical guarantees of privacy (e.g., differential privacy). Beaulieu-Jones, et al. (2018)

employ differential privacy in the architecture of their GAN, which allows the authors to control

for the privacy of the participants by adding a small amount of random noise to the weights of

a GAN. A more technical description of differential privacy is available in Abadi, et al. (2016).

Naturally, the ability of GANs to mitigate privacy issues is conditional on the performance of

the fake data or combinations of fake and real data in marketing modeling practices.

2.3 Marketing modeling

Several empirical studies in the subfields of marketing literature report the limitation of having

limited data. The literature ranges from churn prediction (Neslin, et al. 2006: Holtrop, et al.

2017), attribution modeling (Anderl, Becker, von Wangenheim & Schumann, 2016) to

modeling markets (van Heerde, Gijsenberg, Dekimpe & Steenkamp, 2013). Especially, the

performance, staying power and architecture of churn models have been heavily studied in the

marketing literature (Ha, Cho & Maclachlan, 2005: Lemmens & Croux, 2006: Neslin, et al.

2006: Risselada, Verhoef & Bijmolt, 2010: Ascarza & Hardie, 2013: Holtrop, et al. 2017).

Risselada, Verhoef & Bijmolt (2010) describe that the predictive accuracy of the churn models

heavily depends on the volume and nature of the data set (e.g., no free lunch theorem). The

authors describe that a logistic regression outperforms when the volume of data is fairly limited.

However, when having big data sets (n ≈ 1000), neural networks, tree methods and ensemble

methods outperform other methods (Perlich, Provost & Simonoff, 2004). This implies that

when generating additional fake data to the real data, this most likely has the highest impact in

predictive accuracy for neural networks, tree methods and ensemble methods.

(18)

10 while preserving individual privacy of medical patients. However, this difference is not tested

for significance, nor in a marketing context and a limited amount of machine learning methods

are employed. This leads to the second hypothesis:

H2: The predictive accuracy of machine learning techniques is significantly lower on fake

data than on real data.

Complementary to Beaulieu-Jones, et al. (2018), this study investigates whether the addition of

the generated fake data, to the real data, increases the predictive accuracy of the machine

learning techniques in a churn context. From an artificial intelligence perspective, Goodfellow,

Shlens & Szegedy (2014) use fake data points, also called “adversarial examples”, to increase

the performance of predictive models. The goal is to make the algorithm robust and learn from

the adversarial examples that are extremely similar to the real samples and being able to separate

them with high confidence.

From a marketing perspective, Risselada, Verhoef & Bijmolt (2010) describe how the

performance of machine learning techniques heavily depends on the nature and amount of data

in a case for predicting churn. Also, as the number of observations in a data set increases, the

neural networks, tree and ensemble methods start outperforming the other methods (Perlich,

Provost & Simonoff, 2004). Leeflang, Wieringa, Bijmolt & Pauwels (2015) elaborate on this

idea by describing how in general with more variability in data samples, the predictor variables

are estimated with more precision. This relates to the property of a GAN to be able to generate

data with high stochasticity and variability within, or sometimes even outside (see Figure 7 or

9), the range of the real data distribution (Goodfellow, et al. 2014). Therefore, when adding

generated samples from a GAN to the real data, the predictor variables are estimated with more

precision and the predictive accuracy is expected to increase.

This leads to the following

sub-hypothesis:

H2a: The addition of generated fake data to real data significantly increases the predictive

accuracy of machine learning techniques compared to only real data.

(19)

11 data is that GANs are only able to generate data on variables that are readily available. Every

introduction course on statistics tells us that having more data reduces uncertainty (i.e., central

limit theory). Therefore, the variation or standard deviation of the parameters decrease which

results in high t-values (e.g., OLS). In theory, when constantly increasing the amount of data

for the parameters, all the parameters have a significant effect on the dependent variable.

As described, the generated samples consist of high variability and stochasticity in terms of the

values of the variables (Goodfellow, et al. 2014). Empirically, Beaulieu-Jones, et al. (2018)

confirmed that the generated fake data consists of a higher variability compared to the real data,

due to the addition of noise to the weigths. Therefore, when constantly increasing the amount

of data for purpose of estimation, the variance in the fake estimation is different compared to

the standard deviation in the estimation on real data. This leads to the development of the third

hypothesis:

H3: The variance of parameters is different in the estimation on generated fake data,

compared to an estimation on the real data.

The question remains whether the parameters, defined as β, in the models change when using

only the fake data. To time of writing, there are no studies utilizing fake data from a GAN for

comparison of the effects in a market model estimated on generated fake data versus a model

estimated on real data. In case of successful convergence of the GAN, the fake data resembles

the real data distribution in a high dimensional space. Therefore, the relationship between the

variables is contained and the effects of parameters in both models are expected to be similar

(Goodfellow, et al. 2014). This leads to the development of the following sub-hypothesis:

H3a: The parameters are equal between the estimation based on generated fake data and an

estimation on the real data.

(20)

12 in the fake estimation is equal to the real estimation. However, according to Beaulieu-Jones, et

al. (2018) the variance is higher in the fake estimation due to the noise added to the weights of

a GAN to account for privacy of the individual. When assuming that the parameters are the

same and the variance is different, then the following sub-hypothesis is developed:

H3b: The t-values of parameters differ between the estimation on generated fake data,

compared to an estimation based on real data.

The fact that the generated fake data affects the level of uncertainty and parameters of an OLS

estimation, makes it worthwhile to investigate the effects on the predictive validity between the

model estimated on generated fake data and real data. The predictive validity measures of a

model provide information on the predictive power of the estimation. When these measures are

comparable between the two estimations, this would promote and enable data sharing and

mitigate privacy concerns in marketing. A predictive validity measure that is dimensionless

across models is the Mean Absolute Prediction Error (MAPE). Thus, the MAPE allows for a

comparison of the predictive validity of two estimations regardless of the scale of the dependent

variables. Therefore, this study uses the MAPE to compare predictive performance. There is no

literature available to develop a hypothesis, thus the following empirical issue is posed:

How does the generated fake data influence the MAPE, compared to the estimation on the

real data?

Moreover, it is interesting whether the model outperforms a naïve model that takes the value of

the dependent variable in t-1 as a prediction for the next period (t+1). These measures are the

Relative Absolute Error (RAE) and Theil’s U-statistic. When the outcome of these statistics is

less than one, the model outperforms a naïve model (Leeflang, Wieringa, Bijmolt & Pauwels,

2015). This scenario would imply that the estimation of fake data is worthwhile. Therefore, the

following empirical issue is formulated:

How does the generated fake data influence the RAE and Theil U-statistic, compared to the

estimation on the real data?

(21)

13

3. Generative Adversarial Networks

A GAN is described by the idea of a game between two players by Goodfellow (2016). The

first player is a generator that creates samples that have the goal to represent the real training

data distribution. The second player is a discriminator that tries to distinguish between the fake

and real data. This game is best illustrated by a competition between a counterfeiter trying to

create real money and the discriminator that acts as the police to collect all the counterfeit

money, see Figure 1. The goal of the game is to let the generator create good enough samples

that fool the discriminator. When the generator succeeds, the distribution of the samples

resembles the real money.

3.1 Formal objective

To put more formally, both players are functions represented by deep neural networks that are

differentiable in parameters and inputs. GANs that have the objective of generating images are

often referred to as DCGANs introduced by Radford, et al. (2015), which refers to deep

convolutional generative adversarial network. Convolutional neural networks allow for the

preservation of the correlation among variables in a data set. These networks are regarded as

state-of-the-art for GAN implementations since the publication of Radford, et al. (2015). For a

technical review of convolutional neural networks see Schmidhuber (2015) or Goodfellow,

Bengio & Courville (2016). For an extensive review on artificial neural networks see Leeflang,

Wieringa, Bijmolt & Pauwels (2017).

Goodfellow, et al. (2014) define the discriminator as function D that takes x as inputs, where x

is the real or fake data distribution and outputs a probability of a sample to be fake or real in a

range of (0,1). The generator is defined as function G that takes inputs as z, where z is a random

Gaussian distribution over the same dimensionality as the real data distribution, which outputs

a fake sample of data points in a range of (-1,1) in case of a hyperbolic tangent activation

function. These functions or players are competing in a minimax game, see equation (1).

min

𝐺

max

𝐷

𝑉(𝐷, 𝐺) = [𝔼

𝑥~𝑝𝑑𝑎𝑡𝑎(𝑥)

log 𝐷(𝑥) + 𝔼

𝑧~𝑝(𝑧)

log(1 − 𝐷(𝐺(𝑧)))] (1)

(22)

14 noise

data set

real/fake

samples evaluated by D to be real. 𝔼

_{𝑥~𝑝𝑑𝑎𝑡𝑎(𝑥)}

refers to an expected sample from the probability

distribution of the real data set and

𝔼

𝑧~𝑝(𝑧)

refers to an expected sample from the probability

distribution of the fake data set. Specifically, the objective is to maximize the discriminator D

over the real training data x, while minimizing the generator G (i.e., D(x) ≈ 1 and 1 - D(G(z)) ≈

0). This procedure results in the generator creating a data distribution z that approximates the

real data distribution x. Therefore, the generator generates realistic fake data samples. A

simplistic representation of the architecture of a GAN is visible in Figure 1.

3.2 Loss functions

A loss function is a function that the neural networks attempts to minimize. The correct loss

function depends heavily on the dependent variable. For example, in the case of a continuous

dependent variable, the network could minimize the mean squared error or root mean squared

error (MSE & RSME). In the case of a classification problem, the loss function is cross-entropy.

Minimizing this concept of cross-entropy is equal to maximizing the log-likelihood or

minimizing the negative log-likelihood. To illustrate this concept, cross-entropy is defined as

any difference between an empirical probability distribution and the probability distribution

defined by a model, defined as 𝑝

_{𝑚𝑜𝑑𝑒𝑙}

(𝑥). For example, in case of classification, the process of

adjusting the weights (θ) of a discriminator by means of a maximum likelihood procedure,

creates a latent probability distribution which approximates the empirical data distribution (for

proof consider, Murphy, 2012: Goodfellow, Bengio & Courville, 2016).

This distance between the empirical data distribution 𝑝̂

𝑑𝑎𝑡𝑎

(𝑥) and 𝑝

𝑚𝑜𝑑𝑒𝑙

(𝑥) is defined as the

negative log-likelihood, as in equation (2). Here, taking the log prevents the loss function to

saturate, as the likelihood of a function is very close to zero. To illustrate why taking the log is

Generator (G)

Discriminator (D)

(23)

15 important, consider a scenario where the likelihood approaches zero due to taking the product

of multiple examples in the maximum likelihood procedure. Here, the gradient of the loss

function is likely to be very small, which results in a very small learning rate and slows down

training or convergence of the GAN (Goodfellow, 2016). Therefore, taking the log allows a

function that increases everywhere, which results in a more stable function and allows faster

convergence.

−𝔼

𝑥~𝑝̂𝑑𝑎𝑡𝑎

[log 𝑝

𝑚𝑜𝑑𝑒𝑙

(𝑥)] (2)

Therefore, Goodfellow, Bengio & Courville (2016) describe that one way of interpreting

maximum likelihood estimation is to view it as minimizing the difference between the empirical

distribution and the model distribution, also referred to as the Kullback-Leibler divergence.

Theoretically, minimizing the Kullback-Leibler divergence is possible between two Bernoulli,

SoftMax or Gaussian distributions.

3.2.1 Discriminator

Consider the discriminator in Figure 1, the neural network tries to determine which one of the

samples is fake or real by means of a sigmoidal activation function (see section 3.5.1). The

network attempts to minimize the cross-entropy between the real data distribution and the latent

distribution created by the discriminator, by means of adjusting the weights through mini-batch

stochastic gradient descent. Section 3.3 introduces the method of the neural network to

minimize the cross-entropy. The minimization of the cross-entropy leads to matching the real

data distribution to the model’s latent distribution, see Goodfellow, Bengio & Courville (2016)

for more details. The formal cost to minimize the discriminator or cross-entropy is displayed in

equation (3) (Goodfellow, 2016).

𝐽

(𝐷)

(𝜃

𝐷

, 𝜃

𝐺

) = −

1

𝑚

𝔼

𝑥~𝑝𝑑𝑎𝑡𝑎(𝑥)

log 𝐷(𝑥) −

1

𝑚

𝔼

𝑧~𝑝(𝑧)

log(1 − 𝐷(𝐺(𝑧))) (3)

(24)

16 3.2.2 Generator

From equation (1) and (3) we derive that the generator G attempts to minimize log(1 - D(G(z)).

In other words, it strives to make the discriminator D believe that the generated samples G(z)

are real. Goodfellow, et al. (2014, 2016) propose that training such a network in practice is not

ideal. In the initial phase of training, the discriminator minimizes the cross-entropy on a

combination of real and fake samples. As a result, the generator has no loss function or gradient

to minimize the same cross-entropy (i.e., vanishing gradients). Therefore, Goodfellow, et al.

(2014, 2016) propose that the generator G needs to maximize log(D(G(z))) instead of

minimizing log(1 – D(G(z)). This enables the generator to have a more stable gradient to

maximize the cross-entropy (see Goodfellow, et al. 2016 for details). To illustrate this problem,

in the initial phase of training G generates very poor samples since it just samples from random

noise thus the discriminator is very accurate in separating the real samples from the fake

samples. In this situation, D predicts all the classes correctly and log D(G(z)) is close to zero.

In this situation, G has no loss function to minimize and is unable to generate realistic samples.

By maximizing log(D(G(z))), the gradient of the loss function is less likely to saturate in a

situation where D is highly confident. This leads to the formal cross-entropy cost or negative

log-likelihood of the generator, where again we apply the log function to prevent the gradient

from saturating:

𝐽

(𝐺)

= −

1

𝑚

𝔼

𝑧

[log 𝐷(𝐺(𝑧))]

(4)

Here, G has the objective to maximize the cross-entropy of D over G(z), over m mini-batches.

Both the loss function of G and D are heavily adapted throughout the literature of GANs (e.g.,

Martinez & Kamalu, 2018: Kumar, Biswas & Sanyal, 2018).

3.3 Gradient descent and learning rate

(25)

17 Bishop, 2006: Goodfellow, Bengio & Courville, 2016). Back-propagation aims to derive

weights in each layer of the network that ensure that for a specific input vector, the output

produced by the function is the same or close the desired output. The difference between the

actual output vector and desired output vector is minimized by taking the partial derivative

(gradient) of this error with respect to each weight in the network (Rumelhart, Hinton &

Williams, 1986). The concept of a gradient of a loss function is depicted in Figure 2. The

weights of the network are updated taking a step in the opposite direction of the sign of the

gradient. When the gradient is positive the weights are tuned more negatively and vice versa to

reach the global minimum, see equation (5).

In this example, the gradient is negative, thus the weights are updated in a positive direction.

This step is often referred to as the learning rate α (Ruder, 2016: Goodfellow, Bengio &

Courville, 2016).

𝑥

′

= 𝑥 − 𝛼 ∇

𝑥

𝑓(𝑥)

(5)

Here, x represents the weight,

𝛼 represents the learning rate, ∇

_𝑥

𝑓(𝑥) the gradient of a loss

function and

𝑥

′

the updated weight. Notice that the direction of the step is learning rate is

opposite of the sign of the gradient, as the minus changes the sign of the gradient. In Figure 2

the gradient is negative, which results in an increase of the weights. Intuitively, it is important

to pick a reasonable value for this step. If the learning rate is too small, finding the global or

local minimum takes ample training iterations. If the learning rate is too large, the optimizer

never finds a local or global minimum of the loss function. In practice for deep neural networks,

it is not given to arrive at a global minimum and often a local minimum or even a saddle point

is found. Choromanska, Henaff, Mathieu, Arous, LeCun (2014) show that when increasing the

(26)

18 number of hidden layers, getting stuck in a local minima is less of an issue as the performance

of the network does not differ much from when a global minimum is found. Intuitively, one

could imagine that when restricting the network to find the global minimum, the network is

very likely to overfit.

Ruder (2016) describes that stochastic mini-batch gradient descent (SGD) based training

algorithms are applied to find the global minimum of a loss function. Here, stochastic refers to

the fact that the data, to minimize the loss function, is drawn randomly. Mini-batch signifies

the fact that the data fed to minimize often consists of a random sample of 50 to 256

observations from the original data set. Goodfellow, Bengio & Courville (2016) describe the

fact that the weights are updated by a stochastic mini-batch of data makes the loss function

differ each time a mini-batch is sampled. This iterative process drives the loss function to a

minimum. Therefore, it is not always required to arrive at a global minimum due to the iterative

nature of the stochastic gradient procedure. In addition, it would be computationally too

expensive to restrict an optimizer to only satisfy for a global minimum, especially when the

loss function is represented in a high-dimensional space.

3.4 Optimization algorithms

The question that remains from the previous paragraph is: “What is a good learning rate for my

loss function?”. Naturally, a frequently occurring answer would be: “It depends”. Goodfellow,

Bengio & Courville (2016) recognize the importance and difficulty of finding the correct value

for this hyperparameter, since it has a significant effect on model performance. The authors

describe several adaptive optimization algorithms: AdaGrad, RMSprop and Adam. These

methods all have in common that the learning rate is adapted to the value of the gradient. One

should understand that the larger the gradient, the smaller the learning rate for the next step.

Intuitively this makes sense, the larger the gradient, the steeper the slope, the closer the gradient

is to the global or local minimum of a function, see Figure 2.

(27)

19 to the choice of hyperparameters in the neural network, compared to non-adaptive SGD

algorithms. Here, hyperparameters refer to the overall architecture of the neural network (e.g.,

the number of layers, the activation function, loss function, dropout, amount of regularization).

Therefore, these optimizers are preferred over the non-adaptive SGD optimizers in neural

networks or GANs (Beaulieu, et al., 2018: Karras, et al. 2018: Kumar, Biswas & Sanyal, 2018).

Due to these arguments, the generator and discriminator use the Adam optimization algorithm

in the architecture to minimize the loss function (see section 4.3.1).

3.5 Activation functions

An activation function transforms the weighted summed inputs of the network, into a

probability (Leeflang, Wieringa, Bijmolt & Pauwels, 2017). Therefore, the activation function

is usually referred to as the output layer of a neural network (Goodfellow, Bengio & Courville,

2016). The design of these functions is an active field of research and do not enjoy a highly

theoretical background yet.

3.5.1 Discriminator

The objective of the discriminator is to distinguish between the fake from the real samples. A

popular activation function in GANs for the discriminator is the sigmoid or logistic function

(Goodfellow, Bengio & Courville, 2016).

𝜎(𝑥) =

1

(1+𝑒−𝑥₎

(6)

(28)

20 3.5.2 Generator

Goodfellow (2016) describes that the generator has very few restrictions on the design. In

practice, the output layer of the generator generally consists of the hyperbolic tangent or tanh

activation function, see equation (7). The usage of this activation function is rather practically

motivated than theoretical (e.g., Karras, et al. 2018: Kumar, Biswas & Sanyal, 2018). Kumar,

Biswas & Sanyal (2018) & Karras, et al. (2018) employ data normalization so that the generated

fake data are in a range of (-1, 1). This property makes the tanh function most appropriate for

the output layer, since z

∈

(-1,1).

tanh(𝑧) =

𝑒𝑧−𝑒−𝑧

𝑒𝑧_+𝑒−𝑧

(7)

3.5.3 Rectified linear unit

The main disadvantages of the previously described activation functions is saturation. This

makes gradient-based learning for the hidden layers very difficult. For this reason, activation

functions that suffer from saturation are excluded from the hidden layers in the architecture of

a neural network.

Nair & Hinton (2010) introduced rectified linear units (ReLU) as a method

for hidden layers to not saturate for extreme input values, see equation (8).

𝑦

𝑖

= {

𝑥

𝑖

if 𝑥

𝑖

≥ 0

0 if 𝑥

_𝑖

< 0

(8)

Where

𝑥

𝑖

is the input value of the weights and

𝑦

𝑖

the output. The authors show that the

(29)

21 𝑦

_𝑖

= 𝑥

_𝑖

𝑦

𝑖

= 0

𝑦

_𝑖

= 𝑥

_𝑖

𝑦

_𝑖

= 𝑎

_𝑖

𝑥

_𝑖

3.5.4 Leaky ReLU

Maas, Hannun & Ng (2013) proposed Leaky Rectified Linear Units to overcome this problem

by expanding the range of the rectified linear unit, see Figure 3. Where 𝑎

_𝑖

is a hyperparameter

in the range of (0, +∞). This property allows for a small gradient when the unit is not active

(i.e., 𝑥

_𝑖

= 0). Xu, Wang, Chen & Li (2015) propose a value of 5.5 for 𝑎

_𝑖

. While, Radford, et al.

(2015) propose a value of .2. This hyperparameter gives the neuron the opportunity to recover

from the inactive status. As visible in Figure 3, the Leaky ReLU does not suffer from the

problem by multiplying the input value with the defined hyperparameter. Therefore, the Leaky

ReLU is a more effective learning function in the hidden layers of a neural network (Xu, Wang,

Chen & Li, 2015). Therefore, leaky ReLU is regarded as the standard for the architecture of

state of the art GANs (e.g., Ledig, et al. 2016 and Karras, et al. 2018).

3.6 Training procedure

The GAN starts training by combining a randomly generated mini-batch, denoted as m, of data

z, where z is drawn from a normal distribution (z ∼ N(0,1))

,

with a mini-batch of the real data

x, see Algorithm 1 adapted from Goodfellow, et al. (2014). The combined batch of data is used

to train the discriminator D with the objective to identify the fake samples. To reach their

objective, the discriminator and generator attempt to minimize the cross-entropy (Goodfellow,

2016). To minimize the cross-entropy, the two networks (D and G) simultaneously apply

mini-batch stochastic gradient descent (SGD) in the function space in an attempt to find a global

minimum or local minimum (see section 3.3). Goodfellow (2016) recommends the usage of the

gradient-based optimization algorithm Adam (Kingma & Ba, 2014). This procedure allows the

discriminator to optimize weights, improve the prediction accuracy and the generator to

(30)

22 construct more realistic samples. If both models have sufficient capacity, the competition

between the two networks converges when Nash equilibrium is accomplished.

Nash equilibrium is defined as a state where, two players do not gain from deviating from their

strategies (Nash, 1950). For example, in a game between the discriminator and generator, Nash

equilibrium is reached when both networks do not gain much from adjusting the weights to

minimize the loss function. In this state, the generator produces highly realistic samples that

make the discriminator unable to separate the real x from the fake samples z (i.e., D(x) = .5).

Where in practice the stochastic gradient descent of D(G(z)) usually is performed for the

generator, contrary to 1 - D(G(z)) (see section 3.2.2).

3.7 Non-convergence

To reach this state of Nash equilibrium has found to be very difficult and is subject to ongoing

research (Radford, et al. 2015: Salimans, et al. 2016: Arjovsky, Chintala & Bottou, 2017).

3.7.1 Mode collapse

(31)

23 explanation of why this problem occurs is the lack of diversity in the mini-batches that are

provided to the discriminator, while the real distribution has a higher level of diversity. Here,

the discriminator trains on a low in diversity mini-batch, which implies that the generator only

has to generate a limited number of diverse samples to fool the discriminator. In the next

iteration of training, the discriminator has a different low diversity mini-batch and the generator

adjusts the weights accordingly. This prevents the minimax game from converging

(Goodfellow, 2016: Salimans, et al. 2016).

3.7.2 Evaluation of training

Theis, Oord, & Bethge (2015) attempt to define a measure to evaluate the approximation of

distributions. The authors conclude that there is not a single measure to evaluate the

performance of a generative model. Low cross-entropy or high likelihood does not mean that

the data samples from the generator are of high quality, as low likelihood does not imply that

the data samples are of low quality (Goodfellow, 2016). Consider mode collapse as a result of

this scenario. Theis, Oord, & Bethge (2015) conclude that currently there is no state-of-the-art

method or measure to evaluate the performance of a GAN during training.

To the contrary, metrics have been developed to evaluate the generated fake samples after

training a GAN. For GANs that have the objective to specifically generate images, the Inception

Score (IS) has been developed (Salimans, et al. 2016). Often in practice, the loss function that

is being minimized is analysed during training since this value should be minimized for both

the discriminator and generator (e.g., Ledig, et al. 2016: Karras, et al. 2018). This minimization

is no guarantee for realistic samples, thus the generated fake samples should always be

compared with the real data.

3.7.3 Discrete outputs

As described in section 3.1, the generator must be differentiable. This imposes the limitation of

a GAN to generate discrete data outputs (Goodfellow, 2016). Nonetheless, the values approach

asymptotically the discrete outputs.

3.8 Developments towards a stable GAN

(32)

24 3.8.1 One-sided label smoothing

Consider a case where the discriminator minimizes the cross-entropy between the model and

data distribution (see section 3.2). The discriminator has a tendency to minimize its loss

function very rapidly, which leaves no gradient for the generator. One-sided label smoothing

enables the discriminator to be less confident about its predictions (Szegedy, Vanhoucke, Ioffe,

Shlens & Wojna, 2016). To accomplish this, the labels in the data set of the mini-batches are

transformed. Instead of a zero and one indicating whether the labels are fake or real, the labels

are represented by .9 or .1 (Salimans, et al. 2016).

3.8.2 Batch normalization

Ioffe & Szedegy (2015) introduced batch normalization, which transforms the mini-batches in

having a mean of zero and a standard deviation of one. The transformation occurs after every

layer so that the data as input for the subsequent layer is normalized. Intuitively, as the scale of

the data decreases, the scale of the loss function decreases. Therefore, the method reduces the

dependency of the gradients on the scale of the data, allows the network to employ a higher

learning rate thus faster convergence and reduces the need for dropout. LeCun, Bottou, Orr &

Müller (1998) showed that a normalization method greatly increases the speed of training a

neural network. Radford, et al. (2015) showed that the batch normalization prevents the

generator from showing symptoms of mode collapse.

3.8.3 Dropout

To prevent the discriminator to overfit, the architecture of the GAN employs dropout in the

layers. Srivastava, Hinton, Krizhevsky, Sutskever & Salakhutdinov (2014) propose the key idea

of dropout as weights that are randomly dropped to zero in a neural network. This idea induces

stochasticity in the network and prevents the network from overfitting. Intuitively, this is similar

to ensemble methods, as one neuron drops from the architecture a completely different network

arises. This leads to a the creation of an ensemble of sub-networks (Goodfellow, Bengio &

Courville, 2016: page 260). These authors showed that dropout improved the performance of

neural networks dramatically. Nowadays, GANs benefit from dropout in the architecture of the

discriminator (Isola, et al. 2016).

3.8.4 Wasserstein GAN

(33)

25 measures the amount of cost, referred to as “dirt”, it takes to transform the initialized

distribution to approximate the real data distribution (e.g., Gaussian). Intuitively, the dirt is

measured by multiplying the mass of the distribution by the distance it needs to travel to

approximate the real data distribution (Arjovsky, Chintala & Bottou, 2017). Deriving from the

formulation of the Earth-Mover distance, the loss functions of the discriminator and generator

are (Arjovsky, Chintala & Bottou, 2017: Algorithm 1):

𝐽

(𝐷)

= − 𝔼

𝑥~ℙ𝑟

[𝑓

𝑤

(𝑥)] − 𝔼

𝑧~𝑝(𝑧)

[𝑓

𝑤

(𝑔(𝑧))] (9)

𝐽

(𝐺)

_{= − 𝔼}

𝑧~𝑝(𝑧)

[𝑓

𝑤

(𝑔(𝑧))] (10)

In contrast to equation (1), here 𝔼

_𝑥~ℙ_𝑟

[𝑓

_𝑤

(𝑥)] is defined to be a sample from the real data

distribution predicted to be real by the discriminator (D) with a linear activation function.

𝔼

_{𝑧~𝑝(𝑧)}

[𝑓(𝑔(𝑧))] is defined as of D over G(z) indicating how to which degree G(z) real is. The

objective is to maximize equation (9) by training the discriminator, while the objective is to

minimize equation (10). Notice the minus sign in equation (9), this sign switch allows us to

minimize the loss instead of maximizing, which allows us to use optimization algorithms such

as Adam. Contrary to equation (1), the saturation effect of the sigmoid is avoided and the log

function is removed from the loss functions. Due to these features, the authors refer to the

discriminator as a critic instead of a detective, as the discriminator does not longer determine

whether the samples are real or fake, but whether to what degree the samples are fake or real.

The authors provide proof that when G is continuous, which in definition is the case by letting

G be a neural network, the Wasserstein distance has the guarantee of being differentiability and

continuous almost everywhere in contrast to the Kullback-Leibler divergence. To illustrate this,

imagine two uniform distributions. When the distance is very large between these two

distributions, the Kullback-Leibler divergence approaches infinity or zero, which implies that

we are unable to minimize the distance by means of gradient descent. The Wasserstein distance

does not have this property, since it is defined by a differentiable finite Earth-Mover distance

(Arjovsky, Chintala & Bottou, 2017: page 4).

(34)

26 trained until convergence and G is still able to catch up. Intuitively, a converged D is assumed

to be optimal thus is able to give the most accurate gradient to G. Whereas, in case of the

Kullback-Leibler divergence, we needed to account for a delicate balance between D and G,

see section 3.7. Consequently, the authors empirically show that the convergence of the GAN

is more stable. During their empirical study, the authors did not find any evidence for mode

collapse during training (Arjovsky, Chintala & Bottou, 2017). Another advantage of the

Wasserstein distance is the ability to provide a meaningful loss metric that correlates with the

quality of the generated images. Nonetheless, to the day of writing this property has not been

investigated for one-dimensional data generation, especially in a marketing context.

4. Research design

4.1 Data description

The real churn data set is provided by an insurance provider in the Netherlands. Whereas, the

market data set is provided by well-known supermarket chains in the Netherlands.

4.1.1 Customer data

The customer data in this study consists of two churn data sets. The first data set is an artificial

churn data set of 3,333 observations, which is freely available on the internet. The data set

contains the variables described in Table 1.

Table 1, artificial churn data set variables.

Variable Scale Description

Account Length Ratio The number of months a customer has a contract. International Plan Nominal Whether the customer has an international plan. Voicemail Plan Nominal Whether the customer has a voicemail plan. Voicemail Message Integer The number of voicemail messages.

Day Min. Continuous The number of minutes called during the daytime.

Day Calls Integer The number of calls during the daytime.

Day Charge Continuous The amount charged during the daytime.

Eve Min Continuous The number of minutes called during the evening.

Eve Calls Integer The number of calls during the evening.

Eve Charge Continuous The amount charged during the evening.

Night Mins Continuous The number of minutes called during the evening. Night Calls Integer The number of calls during the evening.

(35)

27

Int. Min. Continuous The number of international minutes called. Int. Calls Integer The number of international calls.

Int. Charge Continuous The amount charged from international calls. Cust. Serv. Calls Integer The amount of customer service calls.

Churn Nominal Whether the customer churned.

The second data set is a real churn data set from an anonymous insurance company in the

Netherlands, which consists of 1,262,423 observations. Table 2 describes the variables that are

present in the data set.

Table 2, real churn data set from a telecom provider in the Netherlands variables.

Churn Nominal Whether the customer cancelled the contract.

Gender Nominal Male or female.

Age Continuous The number of years a customer has lived.

Rel. duration Continuous The duration of the relationship.

Collective Nominal Whether a customer is part of an insurance collective. Size of Policy Categorical The size of the policy.

AV 2011 Categorical Additional insurance package of the customer. Complaints Categorical The number of complaints.

Contact Integer The number of contacts the customer made.

Distance to store Categorical The distance to a store.

Address size Categorical The size of the house of a customer.

Incoming contacts Nominal Whether the customer contacted the insurance.

# incoming contacts Integer The number of incoming contacts of the customer to insurance. AV cancellation Nominal Whether a customer has cancelled the additional insurance.

Defaulter Nominal Whether somebody had trouble paying in the past.

Urbanity Categorical The urbanity of where the customer lives. Social class Categorical The social class of the customer.

Stage of life Categorical The stage of life of a customer.

Income Categorical The income that a customer receives.

Education Categorical The level of education a customer received.

“BSR.groen” Nominal An additional insurance package.

“BSR.rood” Nominal An additional insurance package.

Without children Nominal Whether a customer had children. Payment method Nominal The payment method of a customer.

Declared Nominal Whether a customer has declared any value.

(36)

28

Declaration amount Continuous The amount of declarations in euros.

4.1.2 Market data

The market data set consist of lemonade sales of brands in different supermarket chains from

the Netherlands. This data set consists of 4,858 observations. Additional weather data is

collected from the KNMI and Google Trends. Table 3 presents the variables that are in the data

set.

Table 3, real market data set.

Date Categorical The date at the time of sales. Year Categorical The year at the time of sales. Quarter Integer The year at the time of sales.

Week Integer The week at the time of sales.

Chain Categorical The supermarket chain.

Brand Categorical The brand of lemonade.

Unit Sales Integer The units of lemonade sold.

Price PU Continuous The price of the lemonade with the promotion. BasePrice PU Continuous The price of the lemonade without promotion. FeatDispl Integer % of stores with feature and display promotion. DispOnly Integer % of stores with display promotion.

FeatOnly Integer % of stores with feature promotion. Promotion Continuous % of discount.

Revenue Continuous The amount of revenue in euros.

MinTemp Continuous The minimum temperature in Celsius at De Bilt. MaxTemp Continuous The maximum temperature in Celsius at De Bilt. Sunshine Continuous The minimum temperature in De Bilt.

Rain Continuous Duration of rain in .1 hour. KarvanC. Go Continuous Google Trends index (0-100).

4.1.3 Data cleaning and missing values

(37)

29

Table 4, missing values and imputation methods

Variable Missing (%) Imputation method

Rel. duration 4.49 MI (decision tree)

AV 2011 23.89 MI (decision tree)

Distance to store 1.88 Listwise deletion

Urbanity .07 Listwise deletion

Social class 1.69 Listwise deletion

Income 9.78 MI (decision tree)

Education 1.69 Listwise deletion

“BSR groen” 47.02 MI (decision tree)

“BSR rood” 47.02 MI (decision tree)

Without Children 8.61 MI (decision tree)

Payment method .003 Listwise deletion

Declarations amount <.001 Listwise deletion

In order to deal with these missing values, it is important to establish the cause of these values

missing values. Rubin (1976) and Donders, et al. (2006) distinguishes three types of missing

values: missing completely at random (MCAR), missing not at random (MNAR) and missing

at random (MAR). When the reason the data is missing is completely random or the missing

data does not depend on any other variables in the data set, the missing data is considered to be

MCAR. Consider a weighing scale that ran out of batteries, this would lead to missing data

points for the weight of respondents in an experiment. These missing data points are not related

to any of the other variables, but to the fact that the scale ran out of batteries. To determine

whether any of the missing data points are MCAR, a Little´s MCAR-test is conducted (Little,

1998).

Generative adversarial networks in marketing: Overcoming privacy concerns with the generation of fake data. Gilian Ponte S2591634

Generative adversarial networks in marketing:

Overcoming privacy concerns with the generation of fake data.

Generative adversarial networks in marketing:

Overcoming privacy concerns with the generation of fake data.

Gilian Ponte

S2591634

Master Thesis

Supervisor: Prof. dr. J. Wieringa

Second supervisor: dr. K. Dehmamy

Rijksuniversiteit Groningen

Faculty of Economics and Business

Department of Marketing

PO Box 800

9700 AV Groningen

Abstract

Privacy is a fundamental human right. Over the years the right to privacy has come under

pressure by the growth of the internet, methodologies and amount of data. These privacy

concerns have led to individuals reacting negatively to the collection and usage of individual

data. The development of the methods that pressure the fundamental right to privacy, are now

advanced to alleviate these same privacy issues. The recent developments surrounding

generative networks allow the generation of individual fake data based on any real data

distribution. This study is the first empirical attempt to alleviate privacy concerns by means of

a generative adversarial network in the field of marketing. Consequently, for three data sets a

generative adversarial network and a Wasserstein generative adversarial network are

developed. These networks successfully generate fake data that is useful in marketing modeling

cases. Surprisingly, this study shows that estimations on fake data are able to outperform

estimations on real data. The current study specifies that academics and firms are able to

generate fake data to alleviate privacy concerns among individuals, promote data sharing and

even advance the development of theory in all academic disciplines.

Preface

I started my academic career at the Rijksuniversiteit Groningen as a pre-master student in

marketing. During this first year I was uncertain whether I was capable of ever doing a master

during the course of my life. Nonetheless, I believed and convinced myself that when you really

want something, you are always capable of doing so. Two years later, this paper is the result of

a very enjoyable period of time as a MSc Marketing Intelligence student.

During this academic year, I was able to feed my curiosity with courses such as: Data Science

& Marketing Analytics, Market Models, Digital Marketing Intelligence and Customer Models.

These courses provided me a thorough background in modeling. Specifically, during these

courses my interest was directed at the recent developments in artificial intelligence and deep

learning. Over the course of my masters, the development of generative networks gained

exponential attention within and outside the literature of artificial intelligence. I remember

asking Prof. dr. Wieringa during one of these courses whether the marketing literature has

investigated the applications of GANs in a privacy setting. Privacy is an issue, in and outside

marketing, that affects a substantial amount of people. Personally, the development of privacy

protective methods have not gained adequate attention throughout the years. These

developments led to the subject of my master thesis.

First of all, I would like to thank Prof. dr. Wieringa for the elaborate support during and outside

the process of writing my thesis. For example, the support in the application process for the

Research Master and possibly succeeding PhD. I am hoping and looking forward to collaborate

in research on the development of privacy preserving methods in the future. Secondly, I would

like to thank my family for supporting me in many ways, raising me with a persevering and

curious attitude and for always believing in me. In discussions my family was always able to

shed a different light on my research. Finally, I would like to thank my friends with special

gratitude towards: Hidde Smit, Diede Wieldraaijer, Mats Neeft, Felix Lehmkule, Muthia

Khanza and Wisse Smit for all the support and fun I enjoyed from our friendship over the years.

Gilian Ponte

Table of contents

1. Introduction ... 1

1.1 Practical and academic contributions ... 2

1.2 Research question ... 5

1.3 Structure of the study ... 5

2. Theoretical framework ... 5

2.1 The ability of GANs to generate customer data ... 5

2.2 Privacy issues ... 8

2.3 Marketing modeling ... 9

3. Generative Adversarial Networks ... 13

3.1 Formal objective ... 13

3.2 Loss functions ... 14

3.3 Gradient descent and learning rate ... 16

3.4 Optimization algorithms ... 18

3.5 Activation functions ... 19

3.6 Training procedure ... 21

3.7 Non-convergence ... 22

3.8 Developments towards a stable GAN ... 23

4. Research design ... 26

4.1 Data description ... 26

4.2 Data normalization ... 30

4.3 GAN - architecture ... 30

4.4 Wasserstein GAN - architecture ... 36