The untraceable customer: A comparison study of data anonymization techniques

(1)

The untraceable customer:

A comparison study of data anonymization

techniques

Master thesis: MSc Marketing (Intelligence Track) University of Groningen, Faculty of Economics and Business

January, 10, 2021

JOOST VAN GREUNSVEN Student number: 2566508 E-mail: J.van.Greunsven@student.rug.nl

Supervisor (first): Prof. dr. J.E. (Jaap) Wieringa

Supervisor (second): dr. A.E. Vomberg

Acknowledgements:

(2)

2

Abstract

This research studies the effect of the anonymization of customer data on the utility of customer data. In 2018 the European Union (EU) introduced the General Data Protection Regulation (GDPR), with the goal to protect customer privacy and give customers more control over their personal information. With the legislation of the GDPR, firms are obliged to limit their data collection and store only data concerning the goals they have communicated to customers. Despite the rules of the GDPR regarding the limitation of customer information, if data is processed anonymously, firms are allowed to store that information. Besides the introduction of the GDPR, public awareness increased in the past years regarding the use of personal data by firms. Whether firms like Facebook, Amazon and Google can be trusted with customer data became a relevant concern.

Firms can anonymize data by applying one or more techniques of data anonymization. Our study focuses on four commonly used non-model-based techniques of data anonymization: Generalization, Suppression, Cell swapping and Adding noise. We use these techniques to compose datasets that contain several levels of privacy. In order to measure the level of privacy we adopt the principle of k-anonymity. To measure the utility of the dataset, we estimate for each dataset a binary logistic regression model concerning churn behavior of customers. We expect to find a clear downward relationship between an increase in privacy and a decrease in the utility of data (Wieringa et al., 2019). Following the described procedure offers us the opportunity to compare every technique of data anonymization on the same levels of k-anonymity.

The results of our study show that the anonymization of data has little effect on the predictive performance. Therefore, contrary to precedent research, the results do not support a downward relationship between an increase in privacy and a decrease in predictive performance, which is surprising. Although the results are unexpected, it offers potential for firms and society in general. The results offer potential for firms as they are able to comply with the legislations of the GDPR without losing substantial predictive performance. Our results offer potential society in general as the anonymization of data could help to regain the trust of consumers in how firms use their personal information.

Finally, we provide an overview of the advantages and disadvantages of every data anonymization technique. The overview could help managers to choose the best technique to implement in their organization.

(3)

3

1 Introduction

Correctly applying marketing analytics can play a vital role when running a business. Germann et al. (2013) find that more focus on marketing analytics by firms increases the return on assets up to eight percent per annum. In addition, a study of Martin et al. (2017) reveals the usage of customer data potentially improves productivity and profitability with five to six percent per annum compared to competitors who do not use customer data for marketing analytics purposes. Firms provide personalized recommendations and offers, free services and more relevant marketing communications to customers in order to achieve a higher level of profitability and productivity (Martin & Murphy, 2017). These increases in return on assets, productivity and profitability are essential to survive in a competitive market. Therefore, marketing analytics are key for firms (Wedel & Kannan, 2016). For providing the customized extra services personal information is required, but it comes at expense of customer privacy. When Edward Snowden revealed documents regarding various global surveillance programs, privacy of individual customers became a high-profile topic. Whether firms like Facebook, Amazon and Google can be trusted with customer data became a relevant concern (Verhoef et al., 2016). Public awareness increased and the use of customer data leads to potential threats for firms. More specifically, the use of individual customer data could lead to a loss of consumer trust (Bart et al., 2005; Deighton, 2005) and negative stock valuations (Acquisto et al., 2006). In other words, the misuse of customer data can cause a backlash on firm performance.

Next to the increased public awareness regarding customer privacy, the governmental interest in the use of customer data began as well. The European Union spent many years composing the General Data Protection Regulation (GDPR) and in 2018 they introduced the legislation (European Parliament, 2013). The goal of the GDPR is to protect customer privacy and give customers more control over their personal information. Therefore, the European Union includes specific legislation regarding the collection and the use of customer data in the GDPR. Furthermore, the GDPR incorporates legislation that regulates the storage of individual customer data.

(5)

5 As a preservative measure for firms with respect to customer data, Wedel & Kannan (2016) formulate the concept of data anonymization. The process of data anonymization consists of assuring that a specific customer can no longer be distinguished out of the full data. More specifically, the purpose is to make reidentification of a single individual impossible. Weber (2015) expects that in the coming years a significant portion of the event-based data will be anonymized. Therefore, information regarding the effect of anonymization on the utility of data is a relevant question for managers as well as for academics. This thesis dives into the relationship between anonymous data and the utility of the data. In order to dive into this relationship, we compare various data anonymization techniques in relation to predictive performance. Consequently, we study the following research question:

RQ: How do techniques of data anonymization influence the utility of data?

The theoretical relevance of this research lies in the addition to the already existing literature regarding the anonymization of data. As mentioned, Weber (2015) states that the level of data that must be anonymized increases the coming years. Weber considers the social pressure and stronger government regulations, like the GDPR, as drivers for the increase in data that has to be anonymized. As the amount of data that has to be anonymized increases, information regarding the effect of anonymization on the utility of data is becoming a more relevant question for managers as well as for academics. Currently, various techniques exist to anonymize data. In our research, we focus on commonly used non-model-based techniques. These techniques are: Suppression, Generalization, Cell swapping and Adding noise (Ghinita et al., 2007; Lasko & Vinterbo, 2010; Ohno-Machado et al., 2001; Dwork, 2008). No other study has compared these, in practice commonly used methods, before. We compare these methods with each other and explore in what way the data utility changes depending on the anonymization technique.

The practical relevance lies, similar as the theoretical relevance, in the fact that the level of data anonymity increases in the coming years (Weber, 2015). The introduction of the GDPR partially drives the increase in data anonymization. From the day of the introduction of the GDPR on, a data breach could be punishable up to 4% of the yearly worldwide turnover (European Parliament, 2013). As breaking the rules of the GDPR could lead to serious penalties, the GDPR forces firms to implement data minimization and/or data anonymization. Therefore, managers want to know what techniques of data anonymization are the most effective and efficient for their company. We contribute to the practical relevance by comparing several commonly used data anonymization methods. Ultimately, we offer managers an overview of advantages and disadvantages of every method.

(6)

(7)

7

2 Theoretical framework

In the second chapter we discuss the theoretical background of this thesis. In the first section, we define the construct of privacy. In addition, we discuss the relation between the storage and security of customer information. The first section ends with two persevering measures of data in order to store data in a secure way: data minimization and data anonymization. We elaborate on data anonymization in the second section of the second chapter. In addition, we discuss several techniques of data anonymization in the second section, which we use in the methodology. Next, in the third section we discuss the relationship between data anonymization and data utility. Finally, we draw a conceptual model in order to clarify the theoretical framework.

2.1 Privacy of customer data

The first definition of privacy appeared in 1890, together with the rise of photography and newspapers. Back then, Warren and Brandeis (1890) defined the concept of privacy as “the right to be left alone.” The composed definition was meant to prevent intrusion of an individual's personal environment, such as their house. Next to that, Warren and Brandeis stated that individuals should be protected against undesirable publications, like in a newspaper.

Currently, the definition of privacy is much broader. The shift from protecting physical privacy to informational privacy, asks for different definitions of privacy (Goodwin, 1991; Mason, 2017; Rust et al., 2002). Where physical privacy is about being physically present at someone’s property, informational privacy is about privacy regarding the use of customer data by firms. More specifically, the use of personal information. In the GDPR, the European Union defines personal information as “any information relating to an identified or identifiable natural person.” Firms use personal information (customer data) on two levels: on the aggregated level and on the individual level.

On the aggregated level, firms use customer data to increase the understanding of customers in general. These insights are on a generic level and therefore customers are willing to accept the use of customer data for these purposes, as long as it is beneficial to them. An example of this is the use of RFID tags in retail shops. Retailers could apply RFID tags to products and help to quickly gather information. Retailers use these tags for example to decrease the level of empty shelves or facilitate a quick payment for customers (Smith et al., 2014).

(8)

8 2005; Montgomery & Smith, 2009). The intention of personalization is to increase the relevance of a firm's product and services (Beke et al., 2018).

As the perception of privacy changes to a completely new meaning, relative to the definition in 1890 of Warren and Brandeis, academics compose enhanced definitions for the concept of privacy. Table 1 provides an overview of definitions of the construct of privacy. The common denominator between the definitions of Table 1 is the control over information by an individual. Altman (1975) states a definition quite similar to the definition of Clark & Westin (1968), but Altman does not mention consumer information explicitly. The definitions of Holvast (2007) and Beke et al. (2018) are more specific than the definitions of Clark & Westin (1968) and Altman (1975). Specifically, both definitions describe extensively in what way customers, groups or institutions need to have control over their personal information.

The focus of this thesis lies on the use of customer data. Therefore, the definition by Beke et al. (2018) is the most appropriate definition. Consequently, we adopt this definition for the construct of privacy.

According to the concept of privacy, customers have control over their personal information. The level of privacy depends on the extent to which consumers are aware of and have the ability to control the collection and use of personal information by a firm (Beke et al. 2018). In order to have control over the collection and use of personal information by a firm as a consumer, it is important that firms store this personal information in a safe way. Customers want to be certain that unknown outsiders do not have access to their personal data. Storage and security are therefore closely related.

Nowadays, security breaches happen on a regular basis. In the United States, in 2016, over thousand security breaches occurred. This is an increase of 40% compared to the year before (Bloomberg, 2016). As result of a data breach, stock prices are influenced negatively. This effect of data breaches become stronger when there are more victims, or the leak of data increases (Acquisto et al., 2006; Martin & Murphy, 2017). For firms, the task is to find safe options to store their customer data and decrease the impact of a data breach.

Table 1: Definition of privacy construct

Definition Source

A state of limited access to a consumer’s information Clark & Westin (1968) The selective control of access to the self Altman (1975)

The claim of individuals, groups, or institutions to determine for aa as aa as themselves when, how, and to what extent information about them is

as communicated to others.

Holvast (2007)

The extent to which a consumer is aware of and has the ability to control

as the collection, storage, and use of personal information by a firm

(9)

9 As we describe in chapter one, customer data is key to survive in a competitive market (Martin & Murphey, 2017; Wedel & Kannan, 2016). To survive in a competitive market, marketers need to perform marketing analytics. As input for marketing analytics, marketers use customer data. Since customer data is that important to firms, it is crucial that customers trust firms with their personal information. The increase in the number of data breaches, and the disclosure by Edward Snowden are events that raise the concern if customers can trust firms with the use of personal information (Bloomberg, 2016; Deighton, 2005; Verhoef et al., 2016). To regain, or even increase the trust of customers, firms need to consider measures to preserve customer data.

As preserving measures for safe storage of customer data, Wedel & Kannan (2016) formulate data anonymization and data minimization. Both preserving measures for safe storage of customer data are part of the GDPR and thus legally binding (European Parliament, 2013). Although we concentrate on data anonymization, as this is the focus of this thesis, section two of this chapter explains data minimization as well. The two constructs are that closely related, that we decide to pay attention to both.

2.2 Data minimization and anonymization

The GDPR requires firms to minimize their data storage to only the adequate and relevant information necessary to fulfill the purpose they communicate to the consumer (European Parliament, 2013). Since the 24th_{of May in 2018, the legislation of the GDPR applies directly to the processing of}

all personal data linked to the European Union’s territory. From this day on, a breach of information is punishable by a sanction of up to four percent of the yearly worldwide turnover in case of an enterprise or up to hundred million in all cases (Albrecht, 2016). Consequently, the GDPR forces firms to implement data minimization in order to avoid these sanctions.

To fulfill the requirements of data minimization, marketers need to limit the type and volume of data they collect. Moreover, marketers need to dispose of data they no longer need. As a result, firms are allowed to store only the data of which they communicate the purpose to consumers.

According to the GDPR (European Parliament, 2013), the principles of data protection should only apply to any information concerning an identified or identifiable natural person. Therefore, if a natural person is not identified and cannot be identified, the principles of data protection do not apply. A natural person is not identifiable, according to the GDPR, if information does not relate to the owner of the information. In this case, re-identification is impossible.

To make re-identification of an individual impossible, data anonymization techniques offer solutions. The GDPR does not cover anonymous data, hence the processing of such information, including for statistical or research information, is allowed.

(10)

10 partially, help to increase the trust of consumers in digital technology (Albrecht, 2016). Therefore, even though the GDPR forces to implement data minimization and anonymization, it has benefits.

Capgemini reports that 39% of the customers want to spend more money if they are certain an organization protects their personal data (Capgemini Research Institute, 2018). In other words, increasing customer trust regarding privacy and security could potentially lead to more sales and ultimately into a competitive advantage (Conroy et al., 2014).

Consumers trust firms with the storage of their personal information. Firms therefore need to act appropriately to prevent data breaches from happening. Data anonymization is a potential measure to improve the security of data storage. Relevant principles concerning data anonymization are the principles of differential privacy and k-anonymity. In order to understand the method of this research, we discuss differential privacy and k-anonymity.

Differential privacy

Dwork et al. (2014) defines differential privacy as the paradox of learning nothing about an individual, while learning beneficial information about a population. Therefore, the removal or addition of a single item of a database does not influence the outcome of an analysis (Dwork, 2008). Differential privacy is not an algorithm, but a definition. As an illustration of differential privacy, we provide an example of the paper of Dwork et al. (2014).

Analyzing a medical database potentially teaches us that smoking causes cancer. Insurance companies may change their view on the long-term medical costs of a smoker based on this information. An interesting question regarding differential privacy is the question if joining the medical database harms the smoker, and if yes, in what way.

As a result of an analysis of the medical database of an insurance company, the insurance company might increase the premiums of smokers. Similarly, the insurance premium of the individual smoker increases. On the other hand, the insurance company might help the smoker when offering the option to enter a program to stop smoking. So, certainly the insurance company knows more about the individual smoker than before, but does entering the medical database harm the privacy of the individual smoker? Differential privacy takes the perspective it does not, with the rationale that the impact on the smoker is the same independent of whether or not he is present in the database. It is the conclusion of the analysis that affects the smoker, not the presence or absence in the data set.

K-anonymity

(11)

11 that each individual is indistinguishable from three other individuals (4 - 1). So, the larger the value of k, the greater the implied privacy since only a group can be distinguished and thus, no single individual. Although k-anonymity is a property which provides some basic privacy safeguards, the release of complementary datasets makes data still vulnerable for homogeneity attacks (Machanavajjhala et al., 2007), background knowledge, or insertion attacks (Francis et al., 2017). Hence, k-anonymity is not “the” solution of privacy concerns but is still a convenient protection measure which firms often use.

The level of k is based on the quasi-identifiers present in the dataset. LeFevre et al. (2006) define a quasi-identifier as “a minimal set of attributes that can be joined with external information to re-identify individual records”. Examples of quasi-identifiers could be gender, race, age, and ZIP-code. As the number of quasi-identifiers present in a dataset increases, the more difficult it becomes to assure anonymity for each individual. This is a result of the increased number of combinations that can be created with these quasi-identifiers. For every combination of values of quasi-identifiers, it needs to match at least k individuals when fulfilling the criteria of k-anonymity.

To illustrate a problem of k-anonymity, which is referred to as a joining attack, we use the example of the article of Samarati (2001). Figure 1 presents two tables containing data of customers. The top table contains medical data, released as anonymous data. The bottom table represents external data, which is a voters list.

As quasi-identifiers in this dataset, Date of birth, Sex, and ZIP-code are present. For a data recipient it is not difficult anymore to combine re-identified microdata with public available data. This is illustrated in Figure 1. In Figure 1, the privacy of customers is secured by suppressing the Social Security Number (SSN) and Name of every observation. However, when analyzing the data, one could observe there is only one female, who is born on 09/15/61 with the ZIP-code 94142 (see the black dot). This data string can be linked to data of the voters list. This way, the data Sex female, Date of birth 09/15/61 and ZIP-code 94142 can be linked to the name Sue J. Carlson. Now it reveals that Sue J. Carlson has shortness of breath. In this example, it reveals personal information of an individual person. However, in some cases it might link data to a set of restricted individuals.

(12)

12 Medical Data Released as Anonymous

SSN Name Race Date of Birth Sex ZIP Marital Status Health Problem asian 09/27/64 female 94139 divorced hypertension asian 09/30/64 female 94139 divorced obesity asian 04/18/64 male 94139 married chest pain asian 04/15/64 male 94139 married obesity black 03/13/63 male 94138 married hypertension black 03/18/63 male 94138 married shortness of breath black 09/13/64 female 94141 married shortness of breath black 09/14/64 female 94141 married obesity white 05/14/61 male 94138 single chest pain

white 05/15/61 male 94138 single obesity

• white 09/15/61 female 94142 widow shortness of breath

Voter List

Name Address City ZIP DOB Sex Party …..

….. ….. ….. ….. ….. ….. ….. …..

• Sue J. Carlson 900 Market St. San Francisco 94142 09/15/61 female democrat …..

….. ….. ….. ….. ….. ….. ….. …..

Figure 1: example to illustrate joining attack

Generalization

The process of generalization works as follows: if the level of k-anonymity does not match the desired level of k-anonymity, because individuals are still distinguishable, one replaces precise item descriptions with ones that are more general. For example, consider a specific date of the quasi-identifier Date of birth. If this date is for example 22-10-1993 (dd-mm-yyyy), and you want to generalize this data, a potential manner is to present the year in which individuals are born instead of the date-month-year format.

Another example could be ZIP-code. Presenting the full ZIP-code is very detailed and therefore the chance of distinguishing individuals is relatively high. Similar to the example above, one could present ZIP-codes also in a more general format. Dropping, at each generalization step, the rightmost digit is an option (Samarati, 2001).

(13)

13 swapping, also preserve statistical properties, but compromise on the correctness of the single pieces of information, the tuples.

Another characteristic of generalization is the fact that generalization allows to release all single tuples of a dataset. This is, however, in a more generalized format. Nevertheless, one considers the possibility to disclose all tuples as an advantage of generalization (Samarati, 2001).

Suppression

Samarati (2001) explains suppression in her paper as the removal of single records of data in order to improve the level of privacy. One performs suppression at the level of the tuple, which means that a single record is deleted. Often, suppression and generalization are complementary, and firms therefore use both methods together. Roughly, suppression is complementary to generalization in a way that suppressing little outliers of data results in a decrease in the number of generalization steps. To clarify how these techniques are complementary to each other, we provide an example out of the paper of Samarati (2001).

Table 2 and Table 3 represent the tables Samarati (2001) uses in her paper. In her example, the goal is to achieve a level of k-anonymity equal to k = 2. As a starting point, we present the original table (OT) in Table 2. The bold tuples represent the tuples we need to suppress to achieve the level k = 2. If we solely perform suppression on OT, we suppress four tuples in order to achieve the desired level of k-anonymity. As a result, we suppress a significant amount of data as displayed in table GT [0,0] in Table 3. Consequently, the data we suppress cannot be used anymore when performing marketing analytics and therefore a potential loss of insights occurs.

A solution to prevent the loss of a significant amount of data is to use both suppression and generalization. All the generalized tables (GT) in Table 2 show ways of generalizing OT. The numbers between the brackets represent the level of generalization performed on each variable.

(14)

14 Table 2: Example complementary generalization and suppression with original table (OT) and its generalized tables (GT) Race: R0 ZIP: Z0 Race: R1 ZIP: Z0 Race: R0 ZIP: Z1 Race: R0 ZIP: Z2 Race: R1 ZIP: Z1 Race: R1 ZIP: Z2 asian 94138 person 94138 asian 9413* asian 941** person 9413* person 941** asian 94138 person 94138 asian 9413* asian 941** person 9413* person 941** asian 94142 person 94142 asian 9414* asian 941** person 9414* person 941** asian 94142 person 94142 asian 9414* asian 941** person 9414* person 941** black 94138 person 94138 black 9413* black 941** person 9413* person 941** black 94141 person 94141 black 9414* black 941** person 9414* person 941** black 94142 person 94142 black 9414* black 941** person 9414* person 941** white 94138 person 94138 white 9413* white 941** person 9413* person 941**

OT GT [1,0] GT [0,1] GT [0,2] GT [1,1] GT [1,2]

Table 3: Example complementary generalization and suppression containing generalized tables (GT), with suppression for original table (OT) of table 2

Race: R0 ZIP: Z0 Race: R1 ZIP: Z0 Race: R0 ZIP: Z1 Race: R0 ZIP: Z2 Race: R1 ZIP: Z1 Race: R1 ZIP: Z2 asian 94138 person 94138 asian 9413* asian 941** person 9413* person 941** asian 94138 person 94138 asian 9413* asian 941** person 9413* person 941** asian 94142 person 94142 asian 9414* asian 941** person 9414* person 941** asian 94142 person 94142 asian 9414* asian 941** person 9414* person 941** person 94138 black 941** person 9413* person 941** black 9414* black 941** person 9414* person 941** person 94142 black 9414* black 941** person 9414* person 941**

person 94138 person 9413* person 941**

GT [0,0] GT [1,0] GT [0,1] GT [0,2] GT [1,1] GT [1,2]

Cell swapping

To clarify the concept of cell swapping we discuss an example from Fienberg & McIntyre (2004). Table 4a and 4b show an example and consists of (a) Original Data and (b) Swapped Data. Table 4a contains data of three variables (X, Y and Z) of seven individuals. Suppose the data of variable X is sensitive and therefore one cannot release the data of variable X. When observing the table, one can detect that record five is the only unique record, with an X value of 1, a Y value of 1 and a Z value of 1. It is unique because it is individually distinguishable in the table due to its unique combination of variables. Table 4b has the exact same distribution of variables and individuals, but this data is swapped. The X values of record one and five are swapped, and the X values of record four and seven are swapped.

Table 4a and 4b: Example cell swapping containing original data (a) and swapped data (b)

(a) Original data (b) Swapped data

(15)

15 When Dalenius & Steven Reiss (1982) introduce cell swapping, they propose cell swapping as a confidentiality preserving method in data sets that contain categorical data. Cell swapping is a technique that tries to perturb data, and ultimately protect customer privacy (Agrawal & Srikant, 2000). According to Fienberg & McIntyre (2004), the basic idea behind cell swapping is to exchange values of sensitive variables among individual records. One exchanges records in a way that the lower-order frequency count or marginals maintain. Exchanging records introduces uncertainty regarding sensitive data, but statistical inferences are protected due to the preservation of summary statistics of the data.

In a simple example, as displayed in Table 4a and 4b, one could implement cell swapping by trial and error. One needs to identify the correct swaps and in a simple environment, when a small dataset is used, this is a relatively simple task. However, in a large dataset, cell swapping becomes more difficult. In the literature, one considers the difficult implementation of cell swapping on large datasets as a disadvantage (Samarati ,2001).

Next to the relatively hard implementation on large datasets, Samarati (2001) examines cell swapping as a method where the “truthfulness” of the data is harmed. On a record level, data after swapping is not as pure as it is in the original data. Table 4a and 4b illustrate the statement of Samarati (2001). Not every record has the same combination of X, Y and Z in the swapped data as in the original data.

Adding noise

The last technique we discuss in this thesis is the technique of adding noise. In the literature, adding noise is referred to as the random value perturbation-based approach. According to Kargupta et al. (2003), the technique of adding noise attempts to preserve data privacy by adjusting original values with a randomized process. This process of adjusting values of sensitive data is seen as adding random noise. The goal of adding noise is to protect customer privacy while retaining the underlying patterns of the original data.

Agrawal & Srikant (2000) propose two approaches: Value-Class Membership and Value Distortion. In the Value-Class Membership method, one divides values for an attribute into a set of disjoint, mutually-exclusive classes. We consider the special case in which values for an attribute are rearranged into interval classes. A condition is that the intervals need to be of an equal width. Consider an example containing the rearrangement of salaries. One uses a 10k interval if mainly “low” values are present and 50k intervals when a dataset contains mostly “high” values. Consequently, instead of a true attribute value, the data displays an interval in which the true value lies. As the data does not display true values anymore, but only the intervals of the classes, noise occurs which affects the relationship of the predictive variable and the dependent variable.

When performing the Value Distortion method, the data owner returns a value of ui + v, where ui is

(16)

16 they use the uniform distribution over an interval [−α, α] and Gaussian distributions with µ= 0 and standard deviation σ (Kargupta et al., 2003).

In the process of adding noise, we draw n independent samples from a distribution V: v1, v2 ,...,vn.

As the independent samples are known, we calculate the perturbed values u1 + v1, u2 + v2,...,un + vn

(Kargupta et al., 2003).

As Agrawal & Srikant (2000) and Kargupta et al. (2003) prefer Value Distortion, we choose Value Distortion as a method to add noise. Even though marketers often use the technique of adding noise in practice, Kargupta et al. (2003) claim that in many cases the original data can be measured from the perturbed data. In such cases, a spectral filter that exploits theoretical properties of random matrices is used to filter the noise out of the perturbed data.

2.3 Data utility

The aforementioned anonymization techniques come at the price of a substantial decrease in data utility. For example, generalization and suppression cause a certain level of information loss due to deleting data or replacing data for more general ones. Moreover, the addition of random noise leads to measurement errors. Subsequently, the commercial value of the data deteriorates (Duncan et al., 2001; Rust et al., 2002).

In order to assure a certain level of privacy of customer data, it is likely that firms have to use techniques of data anonymization. Therefore, a trade-off between the level of protection and utility of data exists. This trade-off is highly affected by the type of data. More specifically, if data is low-dimensional or high-low-dimensional (Wieringa et al., 2019). Figure 2 illustrates this downwards relationship between privacy protection and the utility of it.

Panel a of Figure 2 displays the low-dimensional case, where a person is identified by just a few attributes namely: (i) name, (ii) date-of birth, (iii) residence, (iv) degree, (v) profession and (vi) income. In order to increase privacy gains, suppression of one or two attributes significantly improves the level of privacy. In addition, the information loss is only marginal, which means the data is still useful for analytics purposes. The relationship between privacy and utility changes completely in case of high-dimensional data. Panel b of Figure 2 shows such a case and contains an image of Barack Obama. The image exists of a high-dimensional arrangement of pixels, which together form the image of Barack Obama. It is the specific structure of pixels that give meaning to the image and make the person identifiable. In order to make re-identification impossible, the image would need to be generalized to a level in which the image becomes unworkable.

(17)

17 In this study we focus on low-dimensional data and the aforementioned techniques to anonymize it. A negative effect of data anonymization on data utility exists as illustrated in Figure 2. However, what the exact effect is of the level of data anonymization on the utility of data, depends on the chosen anonymization technique (Schneider et al., 2017). This thesis attempts to generate more insights on the effects of these methods on the utility of data.

Figure 2: Trade-off between data utility and protection level for two cases. (Source: Wieringa et al., 2019) (a) Low-dimensional case

(18)

18 Conceptual model

Although no hypotheses are drawn, we draw a conceptual model to conceptualize the theory. Figure 4 shows the conceptual model. For the methodology section, it is vital to have an overview of the techniques of data anonymization which we perform in this research. The conceptual model in Figure 3 provides this overview.

(19)

19

3 Research design

In this chapter, we elaborate on the data and the development of the dataset and models we compose and estimate. The model development section discusses why we operate a churn model and what selection of variables we adopt. Next, we discuss the type of analysis in order to measure the utility of data. Finally, in section four, we discuss the validation of the results.

3.1 Data description

To answer our research question, we use a dataset of a Dutch health insurance company containing churn data of the year 2011. The dataset contains behavioral information regarding the relationship between the health insurance company and individual customers. Besides, demographic statistics are available. We use data from an entire year since customers typically switch at most once a year in the Dutch health insurance sector (Dijksterhuis & Velders, 2009; Donkers et al.,2007). Churners are the customers who have insurance in the beginning of the year, but do not have one at the end of the year.

3.2 Model development

The aim of this thesis is to study what the differences in data utility are as a result of the performance of several data anonymization techniques. The data anonymization techniques we use has been explained in section 2.2. Nevertheless, how we measure data utility, or in other words, the predictive quality has not been discussed yet.

In order to measure the utility of an anonymized dataset, we have to select a theoretical model. According to Holtrop et al. (2017), churn is a major concern for firms. In addition to the explanation of churn in the data description section, Hadiji et al. (2014) studied the behavior of gamers and defined a

“churner” as a player who leaves a service. In the mobile telecom industry, churn refers to a customer

who switches from one provider to another during a certain period of time (Lu et al., 2012).

Furthermore, churn ratio is important to explain. A churn ratio is the ratio of churners over non-churners as a function of time (Hadiji et al., 2014). In order to compare firms with each other, or for firms to compare different periods with each other, churn ratios are a convenient measure.

(20)

20 financials. Therefore, focusing on customer retention instead of, for example, lowering acquisition cost or increasing profit margins, is more beneficial to firms (Gupta et al., 2004). In addition, top executives recognize these benefits, and according to Forbes (2011), customer retention is a main priority of top executives in terms of marketing spending. As churn is an important concern in business and also academics use a churn model to measure the utility of data regarding data anonymization, we adopt a churn model in this thesis.

Our churn model exists of the selection of the following quasi-identifiers: Age, Gender, and Household size. We consider this number of quasi-identifiers because in the case of the study of Sweeney (2000) she is able to uniquely identify 87% of the US population based on Gender, ZIP-code and the full Date of birth. In 2006, Golle revised the paper of Sweeney with more recent data and was able to uniquely identify 63% of the US population based on the same selection of quasi-identifiers as Sweeney used. For our paper, the takeaway of both studies is that it is possible to identify a unique individual based on a small selection of quasi-identifiers. Furthermore, we take in account that the broader the selection of quasi-identifiers, the easier it is to re-identify an individual. Or, the other way around, the more difficult it is to assure anonymity. Because of these two arguments we consider a small selection of quasi-identifiers:

Age - The age is the exact age of an individual at the time the dataset was extracted. Therefore, age is measured as a continuous variable.

Gender - The variable gender refers to the gender of the customer and has a classification as a binary variable. In this dataset there are two options: (0) man, or (1) woman.

Household size - The household size indicates the amount of people who live at the same address. Household size is therefore measured as a continuous variable.

Generalization of variables

If we perform the technique of generalization, or the combination of generalization and suppression, on one of the variables, the type of variable might change. For example, if we measure Age with intervals of twenty years, the continuous nature of the variable changes into a categorical one. Now, observations in the data cannot take on any number for the variable Age, but can only adopt defined interval levels.

(21)

21

3.3 Analysis procedure

Anonymization

First, we anonymize the data to several levels of k-anonymity. Generalization, suppression, cell swapping and adding noise are the methods we use to fulfill this. We perform these anonymization techniques on the selection of quasi-identifiers: Age, Gender and Household size. To determine the level of k-anonymity, we count the unique combinations of the quasi-identifiers. If, out of all possible combinations, a combination appears only one time in the dataset it means that the k-anonymity level of a dataset is equal to one. Similarly, if out of all combinations, one combination appears two times, the level of k-anonymity of the dataset is equal to two. Next, we will discuss in more detail how we bring each anonymization technique in practice. To illustrate the way of implementation to our dataset, we provide for every technique an example.

Generalization

In order to increase the level of k-anonymity with the application of generalization we need to generalize the quasi-identifiers. As the options to generalize a quasi-identifier differ between them, Table 5 shows the options we use in our research. The only option for the quasi-identifier Gender is to generalize it to Person as this is a binary variable. For the ratio variables Age and Household size we use intervals. As we want to limit the information loss, we first test small intervals. We adopt the method of trial and error to find a combination of variables that has the desired level of k-anonymity. If we are not able to achieve the desired level of k-anonymity, we aim to reach a level of k-anonymity as close as possible to the desired level.

Table 5: Options of generalization the quasi-identifiers Age, Gender and Household size

Age Gender Household size

Actual age Female - Male Actual household size

Interval of 5 Person Interval of 2

Interval of 10 Interval of 3

Interval of 15 Do not disclose household size

(22)

22 To clarify how we apply generalization to our dataset, we provide Table 6a and 6b. Table 6a represents a situation before performing generalization and Table 6b after performing generalization. In Table 6a, one record is uniquely distinguishable and therefore the level of k-anonymity is equal to one. The uniquely distinguishable record is the record with Age 98, Gender 1 and Household size 2 and is bold highlighted in Table 6a. In order to increase the level of k-anonymity, we perform generalization to the variable Household size and change the precise household size into intervals of three. Now, we display household size as 1-3 or 4-6. After performing generalization, the record that is uniquely distinguishable in the Table 6a, is now not distinguishable anymore. The level of k-anonymity increases in Table 6b by the application of generalization to a level of three.

Table 6a & 6b: Example before (a) and after (b) performing generalization

Before (a) After (b)

Age Gender HS* Age Gender HS*

45 0 2 45 0 1-3 45 0 2 45 0 1-3 45 0 3 45 0 1-3 45 0 3 45 0 1-3 45 0 3 45 0 1-3 98 1 3 98 1 1-3 98 1 3 98 1 1-3 98 1 2 98 1 1-3 * Household Size Suppression

We bring suppression in practice by suppressing the observations that do not fulfill the requirement of the desired k-anonymity level. So, if the desired k-anonymity level is two, we suppress all the observations that contain a combination of the quasi-identifiers that appear only one time in the dataset. Likewise, if the desired k-anonymity level increases from two to three, we suppress the observations that contain a combination of the quasi-identifiers that appear two times.

Table 7a and 7b illustrate this process. Table 7a, contains one uniquely distinguishable record. Consequently, the level of k-anonymity is equal to one in Table 7a. The uniquely distinguishable record is the record with Age 98, Gender 1 and Household size 2 and is bold highlighted in Table 7a. In order to increase the level of k-anonymity to a level of two, we suppress the uniquely distinguishable record. Table 7b represents the result. Now there is no record uniquely distinguishable anymore, the level of k-anonymity increases from one to two.

(23)

23 Table 7a & 7b: Example before (a) and after (b) performing suppression

45 0 2 45 0 2 45 0 2 45 0 2 45 0 3 45 0 3 45 0 3 45 0 3 45 0 3 45 0 3 98 1 3 98 1 3 98 1 3 98 1 3 98 1 2 - - - * Household Size Generalization + Suppression

As the approaches of generalization and suppression are often complementary to each other, we perform both approaches together in order to measure the effect on data utility. We use the method of generalization as the main approach to achieve the desired level of k-anonymity. The application is similar to the explanation of generalization in this chapter. However, if it is more beneficial to use suppression instead of generalization to achieve the desired level of k-anonymity, we use suppression. We measure if it is more beneficial to perform suppression over generalization by analyzing the amount of generalization steps we have to take to achieve the desired level of k-anonymity. Table 8a and 8b provide an example to clarify what we mean with the number of steps.

In case of Table 8a and 8b, one needs to take only one step of generalization. More specifically, the variable Household size has to change into intervals of two. In the example of Table 8c and 8d, one needs to take multiple generalization steps in order to increase the level of k-anonymity of Table 8c. Currently, the last record of Table 8c, with Age 98, Gender 1 and Household size 6 is unique. Therefore, the level of k-anonymity is one in case of Table 8c. With solely the implementation of generalization, one needs to generalize the age of all records to a level that age is not disclosed at all. Next to that, household size needs to be generalized to a level that the household size is not disclosed at all. In addition, one needs to generalize the variable Gender to Person. As a result, a significant amount of information is lost due to many steps of generalization.

(24)

24 Table 8a, 8b, 8c and 8d: Examples when one prefers generalization (a) & (b) or suppression (c) & (d)

Generalization (a) Generalization (b)

45 0 1 45 0 1-2 45 0 1 45 0 1-2 45 0 3 45 0 3-4 45 0 3 45 0 3-4 45 0 3 45 0 3-4 98 1 5 98 1 5-6 98 1 5 98 1 5-6 98 1 6 98 1 5-6 * Household Size Suppression (c) Suppression (d)

45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 45 0 1 98 1 6 - - - * Household Size Cell swapping

According to the technique of cell swapping we have to swap values within a column in order to improve the level of k-anonymity. We focus mainly on the quasi-identifier Household size as a variable in which swaps take place. We choose for Household size because this variable, in contrast to the other to quasi-identifiers, has a controllable level of variety, but it still has enough variety to swap. The Household size has a distribution from one to six and therefore we can perform the cell swaps in a controllable way. If after performing cell swapping still an insufficient level of k-anonymity is present, we perform cell swapping on either Age or Gender.

(25)

25 In order to apply cell swapping, we detect the uniquely distinguishable records. After that, we find records which are suitable to swap with. As there is no software that supports this process, perform cell swapping by trial and error. Therefore, we expect that cell swapping is the most time-consuming method to apply.

Table 9a & 9b: Example before (a) and after (b) performing cell swapping

45 0 2 45 0 2 45 0 2 45 0 2 45 0 3 45 0 2 45 0 3 45 0 3 45 0 3 45 0 3 98 1 3 98 1 3 98 1 3 98 1 3 98 1 2 98 1 3 * Household Size Adding Noise

(26)

26 Figure 4: Graphical illustration in case variable Gender changes from man to woman and vice versa

We compose four datasets with noise. The first dataset has only noise in the variable Age. In the second dataset we only add noise to the variable Household size. The third dataset contains only noise in the variable Gender. Lastly, in the third dataset we add noise to all variables.

(27)

27 Table 10a & 10b: Example of the performance of noise to the variables Age (a) and Gender (b)

Age (a) Gender (b)

Age Random Age Noise Gender Random Gender Noise

45 0.568 45.568 0 0.395 0 45 1.292 46.292 0 -0.254 0 45 2.564 47.564 0 4.064 1 45 -1.494 43.505 0 -1.095 0 45 -0.432 44.567 0 -1.529 0 98 -3.415 94.584 1 0.521 1 98 0.906 98.906 1 1.836 1 98 -2.630 95.369 1 -2.630 0

Ultimately, we compose 21 different datasets. All datasets have a specific combination of a performed technique and a level of anonymity. To clarify, Table 11 shows an overview of the datasets we compose. As a base for every technique or combination of techniques we start with Dataset 1, which is the main dataset containing a k-anonymity level of k = 1. We decide to study the levels of k-anonymity from one until five. Referring to Figure 2 in section 2.3, it is relatively simple for firms to increase privacy regarding low-dimensional data. Therefore, it is interesting for firms to gain more knowledge on these relatively small steps of increasing privacy.

Table 11: anonymization procedure of data

Generalization Suppression Generalization + Suppression

Cell swapping Adding noise

k – 1 Data set 1 Data set 1 Data set 1 Data set 1 Data set 1 k – 2 Data set 2 Data set 6 Data set 10 Data set 14 Data set 18* k – 3 Data set 3 Data set 7 Data set 11 Data set 15 Data set 19* k – 4 Data set 4 Data set 8 Data set 12 Data set 16 Data set 20* k – 5 Data set 5 Data set 9 Data set 13 Data set 17 Data set 21*

* Have a k-anonymity of 1 due to the generation of unique values

Predictive quality

(28)

28 Where the TDL focuses on predicting customers, the Gini coefficient gives us more information regarding the overall performance of a model. One calculates the Gini coefficient by dividing the area between the cumulative curve and the 45-degree line by the total area under the 45-degree line (Blattberg et al., 2008, p. 319). The Gini coefficient is always below one, and the higher the Gini coefficient, the better the performance of the model.

3.4 Assumptions

Multicollinearity is an assumption that one needs to control for when performing a logistic regression. Multicollinearity takes place when there is a high correlation between predictor variables (Leeflang et al., 2015). Due to the high correlation between predictor variables, it is difficult to determine which variable causes the effect. Consequently, if one wants to determine the causation of a dependent variable, it is essential to measure the effect of the independent variables independently from each other.

As in this thesis, we study the effect of quasi-identifiers, which are relatively unrelated to each other, we do not expect to face a high degree of multicollinearity. However, the variance inflation factor (VIF) gives a definite answer to the question if the model encounters multicollinearity. When measuring the model, we calculate the VIF-scores for the predictor variables.

(29)

29

4 Results

This results chapter starts off with a section that describes how we prepare the data. In the second section we discuss descriptive statistics in order to present an extensive overview of the datasets. Furthermore, we analyze descriptive statistics to determine differences between datasets before we run the actual statistical tests. Section three describes the results of the logistic regression and tests the corresponding assumptions. Finally, in section four we discuss the model performance.

4.1 Data preparation

The data preparation consists of four parts: (i) classification of the variables, (ii) checking for missing data, (iii) checking for outliers and (iv) composing the datasets. First, we discuss the classification of the variables. Since the input is a text file, all variables are qualified as numeric. We restructure the variable Gender as categorical variables as these variables exist of several levels indicated by a number, but are not characterized as a numeric variable. The remaining variables Age and Household size preserve a numeric classification.

Secondly, we check the data for missing values. We do not find any missing data. Therefore, no modifications take place regarding missing values.

Thirdly, we perform an analysis to determine any outliers. There are two types of outliers present in the data: (i) one of the observations is too old that it is physically impossible and (ii) there are observations in which an extraordinary high value appears (9,999). We handle both types of outliers by deleting the whole observation. The dataset contains such a high number of observations that suppressing a very small number of observations does not harm the statistical reliability. In total, we suppress 58,723 observations, on a total of 1 million observations

(30)

30 Originally, our plan was to compose five datasets for every technique of data anonymization. For most of the datasets we succeeded. However, for the technique of generalization it was impossible. After many attempts to compose a dataset with a k-anonymity level of four and a dataset with a k-anonymity level of five, we changed plans. Therefore, we compose a dataset with a k-anonymity level of twelve and a dataset with a k-anonymity level of 174. These levels of k-anonymity are the closest to the levels we desire. To clarify this process more, we use Table 12. In dataset 3, the desired level and actual level of k-anonymity are equal to three. In dataset 3 we include Age with intervals of twenty, Household size with intervals of three and Gender in the original form. Age and Household size in dataset 3 change therefore to generalized forms of the original variables. In order to compose datasets with a k-anonymity level of four and five, we choose to generalize Gender and Household size more intensively. We choose these two variables because generalizing Age leads to a much higher k-anonymity level if we would use intervals of 25. By intensifying the generalization of the variables Gender and Household size, we aim to find levels of k-anonymity that are the nearest to the desired levels of k-anonymity. Ultimately, instead of a k-anonymity level of four, we compose dataset 4 with a k-anonymity level of twelve. In addition, we build dataset 5. Dataset 5 has, alternatively to the desired k-anonymity level of 5, a k-anonymity level of 174. Although a k-anonymity of 174 is not near to the desired level of five, we consider the presence of five levels in each anonymization technique as important. Therefore, we take dataset 5 into our estimations to determine the data utility.

(31)

31 Table 12: Overview of variables per dataset, including corresponding k-anonymity level.

Dataset Variable 1: Age Variable 2: Gender Variable 3: HS** K-anonymity

1 - Base Age Gender Household size 1

2 - Generalization Intervals of 20 Person Household size 2 3 - Generalization Intervals of 20 Gender Intervals of 3 3 4 - Generalization Intervals of 20 Gender Do not include 12 5 - Generalization Intervals of 20 Person Do not include 174

6 - Suppression Age Gender Household size 2

10 - Gen + Supp* Age Gender Household size 2

11 - Gen + Supp* Age Gender Household size 3

12 - Gen + Supp* Intervals of 20 Gender Household size 4

13 - Gen + Supp* Intervals of 15 Gender Intervals of 2 5

14 - Cell swapping Age Gender Household size 2

18 - Age Noise Age Gender Household size 1

19 - HS Noise** Age Gender Household size 1

20 - Gender Noise Age Gender Household size 1

21 - All Noise Age Gender Household size 1

* Generalization + Suppression ** Household size

4.2 Descriptive statistics

The aim of our study is to compare several methods of data anonymization. We have composed multiple datasets with the use of the different anonymization techniques. Before comparing the anonymization techniques on predictive performance, we compare them by analyzing descriptive statistics. We present some descriptive statistics in Table 13. More specifically, Table 13 presents the mean and standard deviation for the continuous variables. For the catergorical variable Gender Table 13 presents the distribution between man and woman in percentages.

(32)

32 Although one would expect that the mean and standard deviation changes by suppressing observation, this did not happen. As the main argument, we assume the richness of the data in terms of the number of observations as the reason why no changes occur in the mean and standard deviation. Potentially, if the number of suppressed observations compared to the total observation is negligible, suppressing does not harm the underlying statistics. To find out whether or not the underlying statistics are harmed, the results of the logistic regression could be of assistance.

Regarding the categorical variable Gender, Table 13 shows us the distribution of men and women in each dataset. Similarly, to the continuous variables, we do not observe significant differences between datasets. We assume that the main reason for this outcome is identical as for the continuous variables, namely that the changes in the number of men or women are relatively small compared to the total observations. Therefore, the distribution in percentages stays the same. However, there are two exceptions. The distribution of men and women alters in dataset 20 and dataset 21. The switch from man to woman and vice versa, caused by the noise, leads to this adjustment. Due to the standard normal distribution that is used to generate the random noise, the adjustment in distribution of men and women stays approximately the same. Both percentages for man and woman change with only 0.10%. We consider these changes to be small and therefore we expect little changes in the utility of data based on the descriptive statistics.

Table 13: Descriptive statistics 1

Dataset Observations Age Gender % Household size

M SD Man Woman M SD 1 - Base 1,127,715 40.31 23.67 47.48% 52.52% 2.87 1.54 2 - Generalization 1,127,715 - - 47.48% 52.52% 2.87 1.54 3 - Generalization 1,127,715 - - 47.48% 52.52% 2.87 1.54 4 - Generalization 1,127,715 - - 47.48% 52.52% 2.87 1.54 5 - Generalization 1,127,715 - - 47.48% 52.52% 2.87 1.54 6 - Suppression 1,127,687 40.31 23.67 47.48% 52.52% 2.87 1.54 7 - Suppression 1,127,667 40.31 23.67 47.48% 52.52% 2.87 1.54 8 - Suppression 1,127,652 40.31 23.67 47.48% 52.52% 2.87 1.54 9 - Suppression 1,127,628 40.31 23.67 47.48% 52.52% 2.87 1.54 10 - Gen + Supp* 1,127,687 40.31 23.67 47.48% 52.52% 2.87 1.54 11 - Gen + Supp* 1,127,667 40.31 23.67 47.48% 52.52% 2.87 1.54 12 - Gen + Supp* 1,127,667 - - 47.48% 52.52% 2.87 1.54 13 - Gen + Supp* 1,127,685 - - 47.48% 52.52% - - 14 - Cell swapping 1,127,712 40.31 23.67 47.48% 52.52% 2.87 1.54 15 - Cell swapping 1,127,710 40.31 23.67 47.48% 52.52% 2.87 1.54 16 - Cell swapping 1,127,710 40.31 23.67 47.48% 52.52% 2.87 1.54 17 - Cell swapping 1,127,710 40.31 23.67 47.48% 52.52% 2.87 1.54 18 - Age Noise 1,127,710 40.31 23.67 47.48% 52.52% 2.87 1.54 19 - HS Noise** 1,127,710 40.31 23.67 47.48% 52.52% 2.87 1.54 20 - Gender Noise 1,127,710 40.31 23.67 47.58% 52.42% 2.87 1.54 21 - All Noise 1,127,710 40.31 23.67 47.58% 52.42% 2.87 1.54 * Generalization + Suppression

(33)

33 When inspecting Table 13, the remaining statistics that have not been analyzed, are the generalized variables. Therefore, we compose a second table with descriptive statistics: Table 14. In Table 14, we report in which model we use the generalized variable. Furthermore, we describe which categories are present in the generalized variable. Besides, Table 14 shows the frequency and the distribution in percentages of each category of each variable. The only remarkable aspect of Table 14, is the relatively small number of observations older than 90 or 100, respectively for the variable Age intervals of 15 and the variable Age intervals of 20. These groups are very little in terms of substantiality compared to the other groups within a variable. The bold marking of these categories refers to this finding. Besides the little substantiality of some groups, no other remarkable observations can be made when analyzing Table 14. We still expect a decrease in data utility after the application of the data anonymization techniques.

Table 14: Descriptive statistics of the occasionally used categorial variables

Variable Model Categories Frequency Percentage

Age intervals of 15 13 0-15 203647 18.06% 15-30 196360 17.41% 30-45 233791 20.73% 45-60 225567 20.00% 60-75 168013 14.90% 75-90 91288 08.10% 90-105 9014 00.80% 105-120 5 00.00% Age intervals of 20 2-3-4-5-12 0-20 267520 23.72% 20-40 280191 24.85% 40-60 311654 27.64% 60-80 209850 18.61% 80-100 58326 5.17% 100-120 174 00.02%

Household size intervals of 2 13 0-2 548842 48.67%

2-4 348049 30.86%

4-6 230794 20.47%

Household size intervals of 3 3 0-3 719022 63.76%

3-6 408693 36.24%

4.3 Logistic regression

To compare the different datasets, we estimate the following logistic regression equation:

𝑃 (𝐶ℎ𝑢𝑟𝑛)

_𝑖

=

1 𝑒−𝑧𝑖

In this equation, P (Churn) is the probability that customer i churns, and zi is the combination of

the three independent variables and the control variables. Next, we want to discuss how we specify zi.

(34)

34 measure the three independent variables differ for some datasets. In some of our models we do not include age as an exact number. Instead, we use intervals of fifteen or twenty for the variable Age. With respect to Household size, we occasionally use an interval of three. Therefore, the equation of zi

can slightly deviate from the specification we describe now.

We specify zias follows:

𝑧𝑖 = 𝛽0+ 𝛽1 𝐴𝑔𝑒𝑖+ 𝛽2 𝐺𝑒𝑛𝑑𝑒𝑟𝑖+ 𝛽3 𝐻𝑜𝑢𝑠𝑒ℎ𝑜𝑙𝑑 𝑆𝑖𝑧𝑒𝑖+ 𝜀𝑖 Agei = age of customer i

Genderi = gender of customer i

Household sizei = household size of customer i

εi = error term for customer i

Assumptions

Before we could analyze the predictive performance of the logit models, we have to check whether the assumptions of a logit model are met. First, we check the presence of multicollinearity. To check if multicollinearity is present in one of the logit models, we calculate the VIF scores. After analyzing the VIF-scores, we conclude that there is no multicollinearity present in one of the logit models. We refer to appendix A for the maximum VIF-score for each model we have composed.

Secondly, we check the assumption regarding sample size. We take in account a ratio of one-to-ten as the minimum of observations in relation to the variables. As all our datasets consist of more than one million observations, the assumption regarding the sample size is easily satisfied. The column Observations of Table 13 displays the number of observations per dataset.

4.4 Model performance

In order to study the effect of the application of the data anonymization techniques on different levels of k-anonymity, we compare the composed logit models with the base model, and with each other. We compare the different models on the performance measures TDL and Gini coefficient. Table 15 shows the outcomes of the TDL and Gini coefficient per logit model. In addition, the column TDL performance of Table 15 presents the difference in percentages in model performance compared to model 1 according to the TDL. Similarly, the column Gini performance of Table 15 displays the percentage difference in model performance compared to model 1, according to the Gini coefficient. Model 1, the base model, has a TDL of 1.81 and a Gini coefficient of 0.40.

Generalization

(35)

35 Gini coefficient increases by 5.00% to 0.42. For the other models in which we apply generalization in order to achieve a higher level of k-anonymity, the firm performances, as we expect, decrease. For model 2, model 3 and model 4 the TDL scores decrease respectively 4.97%, 17.13% and 23.20% compared to the TDL of model 1. In addition, the Gini coefficient of model 3 remains on the same level as the base model. However, the Gini coefficients of model 4 and model 5 decrease both with 5.00% compared to model 1. Besides the surprising model performance of model 2, the model performance of model 3, model 4 and model 5 present an expected growing downward trend in firm performance as the level of k-anonymity increases.

Suppression

The implementation of generalization has a significant effect on the TDL of model 4 and model 5, with a highest decrease of 23.20%. However, similar trends are not present in the application of suppression. The application of suppression has little effect on the performance of the models. The performance of the TDL varies slightly between the models 6 until 9, with model 7 as the worst performing model. Model 7 performs 2.76% worse than the base model according to the TDL. All Gini coefficients remain the same value as the base model. The number of suppressed observations in model 9, the model with the highest k-anonymity level of five, is equal to 87. Relative to the total observations in the original dataset, suppressing 87 observations is a suppression of 0.008% of the data.

Generalization + Suppression

If we combine the techniques of generalization and suppression, we see similar results as for models where we solely use one of both methods. In order to achieve the desired level of k-anonymity in model 10 and model 11, we applied suppression. Therefore, we report identical results for model 10 and model 11 as for model 6 and model 7. The performance of model 10 and model 11 is slightly worse than the performance of the base model, but the decrease in model performance is only measured on the TDL and is 1.10% for model 10 and 2.76% for model 11. Comparable to the surprising increase in model performance of model 2, model 12 and model 13 show an increase in model performance as well after performing generalization. Model 12 performs 4.42% better on the TDL and 2.50% on the Gini coefficient. Model 13 performs worse than model 12 on the TDL with an increase in firm performance relative to the base model of 1.10%. On the other hand, the Gini coefficient of model 13 is superior to the Gini coefficient of model 12 and is equal to 0.43. Therefore, the Gini coefficient of model 13 increases by 7.50% compared to the Gini coefficient of model 1.

Cell swapping

(36)

36 situation. For the models 14 until 17 we detect a downward trend as well, and similar to the TDL, the trend remains at a certain level. In case of the Gini coefficient, the Gini coefficient starts at a level of 0.40 at the base level, and decreases to a level of 0.39. Therefore, the maximum decrease is 2.50%, which is a small decrease.

Adding noise

Finally, we report the results of the models in which we added noise. The models 18 until 21 offer varying results. In model 18 and model 20, the addition of noise has little effect. Only the TDL decreases slightly, from 1.81 to a maximum decrease of 1.78 in model 18. The addition of noise to the variable Household size in model 19 has a more powerful effect and results in a decrease of the TDL of 12.71%, and a decrease in the Gini coefficient of 2.50%. The decrease of model performance as a result of the addition of noise to Household size probably affects the model performance of model 21. In model 21, we perform noise to Age, Gender and Household size. Model 21 has identical results as model 19, with a TDL of 1.58 and a Gini coefficient of 0.39.

Table 15: Top decile lift and Gini coefficient per logit model

Logit-model TDL Gini TDL performance Gini performance

The untraceable customer: A comparison study of data anonymization techniques