Overcoming privacy concerns with the generation of fake data.
Gilian Ponte
S2591634
Master Thesis Defense
July 4, 2019
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
-
Facebook & Cambridge Analytica.
-
87 million users.
-
Without consent.
-
Consequences:
-
119 billion dollars’ (= McDonald’s value).
-
Termination Cambridge Analytica.
-
FTC investigates Facebook.
Facebook and Cambridge Analytica.
(Peltier, et al. 2013: Reuters, 2019).
Increasing relevance of privacy.
-
Firms annually spend around $36 billion.
-
Customers respond negatively to a firm collection and use of individual data.
-
General Data Protection Regulation (GDPR).
-
Martin, Borah & Palmatier (2017) describe that a data breach leads to:
-
Significant negative stock performance.
-
Spillover effects.
-
Transparency and control.
-
Privacy is one of companies and research highest priorities.
(Columbus, 2014: van Doorn & Hoekstra, 2013: Kannan & Li, 2017: Marketing Science Institute, 2018).
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Generative adversarial networks (GANs).
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Discriminator
over real data
x.
Discriminator over
fake data generated
by G(z).
-
Train jointly in a minimax game:
Generative adversarial networks.
5
(Goodfellow, 2016).
Expected sample from the
real distribution.
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Training a GAN.
z ∼ Normal(0,1)
Real data set
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Non-convergence.
-
Very new method!
-
Mode collapse.
-
Evaluation of training.
(Radford, et al. 2015: Theis, Oord, & Bethge, 2015: Goodfellow, 2016: Salimans, et al. 2016: Arjovsky, Chintala & Bottou, 2017).
-
Wasserstein distance:
-
Overlap.
-
Differentiable.
-
Dependency of G on D.
-
Linear activation function (critic).
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Wasserstein - Generative adversarial network.
(Arjovsky, Chintala & Bottou, 2017).
8
GAN =
not defined
(Goodfellow, 2016: Beaulieu-Jones et al., 2018: Kumar, Biswas & Sanyal, 2018). * Three data sets: artificial churn data set (1), real churn data set (2) and market data set (3).
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Hypotheses*
9
-
Theoretically, when a GAN successfully converges the real density surface is approximated.
H1: The
correlation matrix
from the fake data set significantly
correlates
with the real data
correlation matrix.
Introduction - Theory - Methodology - Results: H1: Artificial churn data set - H2 - H3 - Empirical issues - Discussion.
Correlations - Artificial churn data set.
r = .99***
r = .70***
10
Real
GAN
WGAN
Similar
results for the
Introduction - Theory - Methodology - Results: H1a - H2 - H3 - Empirical issues - Discussion.
Artificial churn data set
(i = 50,000)
Real churn data set
(i = 500,000)
Market data set
(i = 300,000)
Multiple experiments in terms of training the GANs have been done to better approximate the real data.
11
***
G
Introduction - Theory - Methodology - Results: H1a - H2 - H3 - Empirical issues - Discussion.
Wasserstein GAN.
Multiple experiments in terms of training the WGANs have been done to better approximate the real data.
12
Artificial churn data set
(i =1,000,000)
Real churn data set
(i = 1,000,000)
Market data set
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Conditionally on successfully generating fake data...
H2: The
predictive accuracy of machine learning techniques is significantly lower
on fake data than
on real data.
H2a: The
addition of generated fake data to real data
significantly
increases
the predictive
accuracy of machine learning techniques compared
to only real data.
Introduction - Theory - Methodology - Results: H1 - H2: Artificial churn data set - H3 - Empirical issues - Discussion.
H2: Artificial churn data set.
Introduction - Theory - Methodology - Results: H1 - H2: Real churn data set - H3 - Empirical issues - Discussion.
H2: Real churn data set.
Introduction - Theory - Methodology - Results: H1 - H2a: Artificial churn data set - H3 - Empirical issues - Discussion.
H2a: Additional data - Artificial churn data set.
16
*n = 512,000
Also, higher compared to
Introduction - Theory - Methodology - Results: H1 - H2a: Real churn data set - H3 - Empirical issues - Discussion.
H2a: Additional data - Real churn data set.
17
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
The normative value of OLS estimations for lemonade sales?
H3: The
parameters are equal between the estimation
based on
generated
fake data and an
estimation on the
real
data.
Introduction - Theory - Methodology - Results: H1 - H2 - H3: Artificial churn data set - Empirical issues - Discussion.
Significantly different
parameters and variance (H =194, H =1,449), but
correlate highly
(r =.99, r =.90).
19
Ef
Introduction - Theory - Methodology - Results: H1 - H2 - H3: Market data set - Empirical issues - Discussion.
Significantly different
estimations (H = 10, H = 5,208), but
correlate highly
(r = .95, r = .65).
21
Ef
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
Empirical issues...
How does the generated fake data influence the
MAPE (Mean Absolute Percentage Error)
,
compared to the estimation on the real data?
How does the generated fake data influence the
RAE (Relative Absolute Error) and Theil
U-statistic
, compared to the estimation on the real data?
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
* Null-model
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion.
-
Very new method.
-
Differential privacy.
-
Artificial intelligence & marketing.
-
Time-series data.
Discussion & limitations.
-
Increase in predictive accuracy!
-
Implications for privacy, data sharing and
the development of theory.
-
Cambridge Analytica & Facebook?
References
1. Arjovsky, M., Chintala, S., & Bottou, L. 2017. Wasserstein GAN. arXiv. https://arxiv.org/abs/1701.07875.
2. Beaulieu-Jones, B. K., Wu, Z. S., Williams, C., Lee, R., Bhavnani, S. P., et al. 2018. Privacy-preserving generative deep neural networks support clinical data sharing. http://doi.org/10.1101/159756 . 3. Columbus, L. 2014, June 27. 2014: The Year Big Data Adoption Goes Mainstream In The Enterprise. Forbes. Forbes Magazine.
https://www.forbes.com/sites/louiscolumbus/2014/01/12/2014-the-year-big-data-adoption-goes-mainstream-in-the-enterprise/#10a2418c2055 4. Doorn, van J. and J.C. Hoekstra. 2013. Customization of Online Advertising: The Role of Intrusiveness. Marketing Letters, 24 (4), 339-351. 5. Goodfellow, I. 2016. NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv. https://arxiv.org/abs/1701.00160.
6. Kannan, P. K., & Li, H. A. 2017. Digital marketing: A framework, review and research agenda. International Journal of Research in Marketing, 34(1): 22–45. 7. Kumar, A., Biswas, A., & Sanyal, S. 2018. eCommerceGAN : A Generative Adversarial Network for E-commerce. arXiv. https://arxiv.org/abs/1801.03244 . 8. Marketing Science Institute. 2018. Research Priorities 2018-2020, Cambridge, Mass.: Marketing Science Institute.
9. Peltier, J. W., Zahay, D., & Lehmann, D. R. 2013. Organizational Learning and CRM Success: A Model for Linking Organizational Practices, Customer Data Quality, and Performance. Journal of Interactive
Marketing, 27(1): 1–13.
10. Radford, A., Metz, L., & Chintala, S. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv. arXiv:1511.06434v2. 11. Reuters. 2019, February 15. Facebook may face multibillion-dollar US fine over privacy lapses – report. The Guardian. Guardian News and Media.
https://www.theguardian.com/technology/2019/feb/14/facebook-ftc-privacy-cambridge-analytica-fine.
12. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. & Chen, X. 2016. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems, 2234-2242. 13. Schneider, M. J., Jagpal, S., Gupta, S., Li, S., & Yu, Y. 2017. Protecting customer privacy when marketing with second-party data. International Journal of Research in Marketing, 34(3): 593–603. 14. Theis, L., Oord, A. van den, & Bethge, M. 2015. A note on the evaluation of generative models. arXiv. https://arxiv.org/abs/1511.01844.
Introduction - Theory - Methodology - Results: H1 - H2: Artificial churn data set - H3 - Empirical issues - Discussion. 27
-
When the discriminator has high certainty over the fake samples.
-
D(G(z)) will be very small (e.g., .001).
-
Thus, assuming that D is very certain (D(G(z)) = .001).
-
log(1 - D(G(z)) = log(.999) = -.001
-
log(D(G(z)) = log(.001) = -6.91
Loss function G.
Allows for a larger
gradient and faster
Introduction - Theory - Methodology - Results: H1 - H2: Artificial churn data set - H3 - Empirical issues - Discussion. 28
-
Similar to we humans learn, from the most difficult examples (Goodfellow, et al. 2014)
-
More variability in the examples given to the network (Leeflang, et al. 2015).
-
Better able to explain variability in the dependent variable.
-
Better generalization to new data by training on more different samples.
-
Risselada et al. (2010) as accuracy highly related to the data set.
Introduction - Theory - Methodology - Results: H1 - H2: Artificial churn data set - H3 - Empirical issues - Discussion.
Dependency of G on D.
29Real data
D captures the real distribution G captures distribution from DDifference Kullback-Leibler and Wasserstein distance.
-
Definitions:
-
Kullback-Leibler divergence = infinity.
-
Loss function not defined!
-
Wasserstein distance ≠ infinity.
-
Implications:
-
Possibility to take the gradient.
-
No dependence of G on D.
-
Does not matter how strong D is.
Introduction - Theory - Methodology - Results: H1 - H2 - H3 - Empirical issues - Discussion. 31
Some first results from a medical trail.
(Beaulieu-Jones et al., 2018).
Introduction - Theory - Methodology - Results: H1: Real churn data set - H2 - H3 - Empirical issues - Discussion.
Correlations - Real churn data set.
r = .89***
r = .43***
32
Introduction - Theory - Methodology - Results: H1: Market data set - H2 - H3 - Empirical issues - Discussion.
Correlations - Market data set.
r = .93***
r = .62***
33