Altering electric vehicle charg- ing database for educational use

(1)

Bachelor Informatica

Altering electric vehicle

charg-ing database for educational

use

Maikel van der Panne

June 8, 2016

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

In recent years, the adoption of electric vehicles in the Netherlands has increased sig-nificantly. Along with this, the need for more large scale research into electric vehicles and charging structure arises. At the Amsterdam University of Applied Sciences (HvA), the IDO-Laad project has assembled a database of charging sessions, that could be used by stu-dents for data analysis at a large scale. The privacy of the electric vehicle users and the sheer size of the database is a burden, however. In this thesis, the possibilities of data anonymiza-tion and data alteraanonymiza-tion are explored in order to propose and implement a software soluanonymiza-tion for anonymizing and altering the IDO-Laad database. As a result, teachers of the HvA can conveniently create tailor-made databases for use in their courses. The problem at hand is implementing this in a structural, failure-proof way. The application discussed in this thesis tackles the anonymization problem by applying k-anonymization, which guarantees anonymity while keeping the information loss to a minimum. Requirement analysis has been performed to match anonymization requirements and end-user expectations. The result is an application that can anonymize the IDO-Laad database with decent runtime and offers features that help teachers create databases for their courses.

(4)

(5)

Introduction

In recent years, the adoption and use of electric vehicles have been much debated subjects. The main benefits of electric vehicles compared to their gas-powered equivalents lie in the environ-mental improvements that could be achieved by eliminating the toxic emissions resulting from the use of fossil fuels in gas-powered cars. Since the development of electric vehicles is still in an early stage, a number of drawbacks are present as well. Examples include range anxiety, which is the fear of running out of charge before reaching the destination, long recharging time, high purchase prices and the limited availability of charging infrastructure [Graham-Rowe et al., 2012].

Since 2011, the adoption of electric vehicles in the Netherlands has seen major improvements. The municipalities of the Dutch G4 (Amsterdam, Rotterdam, Den Haag and Utrecht) have in-vested in the roll-out of charging infrastructure and introduced subsidies to convince residents to make the switch to electric driving [IDO-Laad, 2016, van den Hoed et al., 2013]. In order to conduct research on the development of electric driving in the Netherlands, the IDO-Laad project was established in 2015. It is a research group located at the Amsterdam University of Applied Sciences (HvA). The graduation project that will be discussed in this thesis is part of the aforementioned project. The background on the IDO-Laad project and the position of my graduation project within it will be discussed below.

1.1 Background on the IDO-Laad project

IDO-Laad, an abbreviation of Intelligent Data-driven Optimization Charging infrastructure, is a project that conducts research and develops tools to improve the roll-out of charging infrastruc-ture. The primary goals of this project are to make the roll-out more efficient and cost-effective. These developments all contribute to the adoption of electric vehicles in favor of gas vehicles [IDO-Laad, 2016]. One of the results of this project so far is the formation of a database containing charging sessions of electric cars for a multitude of cities in the Netherlands, including the Dutch G4. Each record in the database represents a charging session. Among other things, a charging session includes information about the charging location, when the session occurred, the duration of the session and the amount of energy charged. A full overview of the available information will be discussed in Section 3.1. The total number of transactions recorded in this database surpasses two million records, which is the largest database of its kind [van den Hoed et al., 2013].

The graduation project involves creating a software product that can alter the IDO-Laad database described above, primarily for use in education circumstances and possibly for public release of (a subset of) the charging session data. More specifically, teachers of the HvA have shown interest in using this database for their courses. As of now, there are strict regulations on who can access the data, and releasing the data to a large group of students is not an option. Supplying the teachers with these data means that students will get the chance to explore a large and unique database, full of research opportunities. As a result, more research can be conducted, which in turn can result in more contributions to electric vehicle development as a whole.

(8)

The project can be divided in two parts, each requiring different types of research: anonymiza-tion and alteraanonymiza-tion of the IDO-Laad database. Anonymizaanonymiza-tion of the data is necessary, as the database contains sensitive data, examples of which are the charging station owner, charging station location and the amount of energy charged. Think about the possibility of tracking down important places such as home addresses by analyzing the charging records of a certain user identifier. In order to determine the exact requirements for anonymization, the municipalities of the cities from which the data originates need to be involved. They are the data owners, and can thus specify what information can and cannot be released after anonymization.

After anonymizing the database, alteration is necessary for HvA teachers to extract the rele-vant data for use in their courses. This prevents teachers from having to use unnecessarily large databases, for example. The end-user needs to be able to choose their own necessities out of a number of implemented options, ensuring tailor-made databases. For both anonymization and alteration, the wishes and requirements of teachers and municipalities are determined through requirement analysis. For each part of the graduation project, a section is devoted to requirement analysis. Experiments will be conducted to determine the performance of the software product implemented and the amount of information lost after anonymization is quantified.

1.2 Research question

Resulting from the background and description of the graduation project, the main scientific problem at hand involves anonymizing in a structural, failure-proof way. When exploring this problem, the trade-off between privacy and data integrity is crucial. Then, user requirements have to be taken into account to produce a software product capable of both anonymizing and altering the IDO-Laad database. Given this description, the research question is formulated as follows:

How can the IDO-Laad database be sufficiently anonymized and appropriately altered for public and educational use, taking into account multiple usage profiles?

(9)

CHAPTER 2

Theoretical background

In this chapter, the theoretical background on data anonymization is discussed. An extensive literature study is performed to get an overview of the anonymization techniques and methods used in related research articles. First, the k-anonymization technique is explained, after which the methods used to implement this anonymization technique are discussed. Then, a different branch of anonymization methods is discussed, which include perturbation and permutation. These methods generally do not follow the k-anonymization principle. The advantages and drawbacks of these methods are assessed and explained.

2.1 k-anonymization

A frequently used technique for data anonymization is k-anonymization. This technique is not a method for anonymizing a database; it is a requirement for the anonymized database, to guaran-tee a certain degree of anonymization. It was first introduced by Latanya Sweeny in 2002, and is defined as follows: a k-anonymized database fulfills the requirement that each record is similar to at least k − 1 other records [Sweeney, 2002, El Emam and Dankar, 2008]. The definition of what similar means in this context will be clear after explaining k-anonymization in detail below. The definition of k-anonymization will then be rephrased in a more strict fashion, which shows how such a requirement can be formulated precisely for any database.

The first step is analyzing the contents of a generic relational database record. When look-ing at a database record from an anonymization perspective, there are three types of attributes (or columns) in a record: direct identifiers, quasi-identifiers and non-identifier attributes. When anonymizing a relational database, the attributes which are direct identifiers, for example a pa-tient’s name, should be discarded as a whole. Attributes that contain non-identifier values do not contribute to identifying a record, and thus can remain in the database even after anonymiza-tion. A quasi-identifier is an attribute or a combination of attributes that can uniquely identify a database record. It may also be possible to identify a database record by combining the quasi-identifier with information from external (public) sources [Sweeney, 2002, Ghinita et al., 2007]. As the released data is intended to be used by a large population, the possibility of identification through the use of external information is a real threat that needs to be taken into account.

Having defined the possible identifier types for all the attributes in a database table, the goal of k-anonymization can be further specified as follows: When applying k-anonymization, similarity is achieved over the quasi-identifier attributes. This means that in a k-anonymized database, for each record, there should be at least k − 1 other records that have equal quasi-identifier values. Furthermore, all records with the same quasi-identifier values form an equivalence class. In order to be able to fulfill the k-anonymization requirement for a certain k, the database must at least contain k records. In Figure 2.1 an example database with patient data is shown, along with its anonymized counterpart. The source of the figures and the data is [LeFevre et al., 2006].

(10)

(a) The original patient data records. (b) The anonymized patient data records.

Figure 2.1: The patient data records featured in [LeFevre et al., 2006], before and after applying k-anonymization. Parameters: k = 2 and quasi-identifier = {age, sex, zip code}.

The example shows a database with patient records, for which the quasi-identifier attributes are determined to be age, sex and zip code. “Disease” is a non-identifier attribute, and thus plays no part in the anonymization process. Furthermore, k equals 2. When evaluating the resulting anonymized database, three combinations of quasi-identifier values can be seen, each of which is present in k records. This means there are three equivalence classes which all meet the k-anonymization requirement. Thus, the database is successfully k-anonymized.

To be able to perform k-anonymization on a database, the value of k and the quasi-identifier need to be determined. The assumption is made that the data holder has the most knowledge about their data, and therefore should be able to identify the quasi-identifier themselves. They should be able to point out the attributes that might appear in external (public) information. When this is not the case, the strength of the applied anonymization is at stake [Sweeney, 2002]. A guideline for choosing k is the re-identification probability or the threshold risk. This probability is equal to 1

k. Choosing the value for k is a trade-off between the risk of re-identification and the

amount of data loss that will occur. When k gets higher, the information loss increases along with it.

Related research suggests a k-value of at least 3, with 5 being recommended. A rough upper limit is 15; higher values of k are rarely used. Using a k-value of 2 is generally advised against, because when a person or entity included in the database identifies itself, it will know information about the other entity in its equivalence class [El Emam and Dankar, 2008, de Wolf et al., 1993]. Ultimately, it is again up to the data owner to determine which risk is acceptable for the database at hand.

2.1.1 Methods to implement k-anonymization

After having explained the process of k-anonymization and the necessary requirements and pa-rameters, this section discusses the methods used to implement k-anonymization. The charac-teristics of two widely used methods will be discussed, including the advantages and drawbacks of each method.

The two most commonly used methods for k-anonymization are suppression and generaliza-tion. Suppression removes data from the original dataset as a whole, such as removing the patient’s names from a patient record database. This method can be linked to the removal of identifier attributes in a database, as described in the previous section. Additionally, suppression can be applied to individual records in the database to achieve k-anonymization. Generalization replaces the original information with a set of values, in which the correct value is embedded. This way it is no longer clear what the exact value is, but there still is a sense of what it might be [Cormode and Srivastava, 2009]. Depending on the parameters of anonymization and the nature of the database, it might not always be necessary to replace the value by a range. Both methods are applied to the quasi-identifier attributes of the database. Referring back to Figure 2.1, it can be seen that generalization has been used for this anonymization process. Figure 2.1b shows that all the ages from the original patient records have been replaced by a range of ages, for example.

(11)

Using k-anonymization and the methods described above to implement it, a deterministic anony-mization model is achieved. As mentioned in the definition of k-anonyanony-mization, the level of anonymity will be guaranteed. This means that the data owner can rely on the anonymization technique to deliver the level of anonymity requested, given the correct parameters. The most prominent downside to these methods is that the representation of the original database is not always retained; for example, a value may be transformed into a range. Therefore, analyzing the anonymized data might require a different approach. It is possible that adjustments have to be made to support the changed data types [Aggarwal, 2008].

2.2 Perturbation & Permutation

Next to the k-anonymization principle, there are other methods of anonymization as well. In this section, perturbation and permutation are discussed. Although the methods are not directly related to k-anonymization, certain principles, such as the attribute identifier types, are used in these methods as well. Furthermore, the concept of partitioning into equivalence classes is used to improve performance in certain implementations [Zhang et al., 2007].

Perturbation is a method of achieving anonymity by adding noise to sensitive attributes of the original database, which is created using a known distribution (such as the Gaussian). Sensitive attributes contain intimate data about a person or entity, and are independent from the attribute types defined for k-anonymization. This method is best applicable to numerical attributes [Ag-garwal, 2008, Zhang et al., 2007]. In [Li and Sarkar, 2006], a method is described to apply perturbation to both numerical and categorical attributes. It aims to anonymize a database, yet preserve the statistical distribution of the data, by applying a Bayes-based swapping procedure on the sensitive attributes. In order to support numerical attributes, these attributes are binned, after which they are treated as categorical data. Due to the binning, the amount of information loss increases.

Permutation involves data swapping of sensitive attributes in the database with the objective of removing the association between the record and its sensitive attribute(s). For example, the connection between a charging session and a certain user identifier. Swapping user identifiers effectively removes this connection between the attributes. This can be done with randomly se-lected records, which increases the information loss significantly. Moreover, the degree of privacy guaranteed by applying permutation is limited, as there is no definitive privacy requirement con-nected to this method, which can be verified afterwards [Cormode and Srivastava, 2009, Zhang et al., 2007]. To improve the quality of permutation-driven anonymization, k, e-anonymized databases are introduced. Like in generalization, partitions or equivalence classes are created and the records in the partitions are permuted. In order to achieve k, e-anonymity, each partition is required to have at least k different values for the sensitive attribute(s), and the range of these values should be at least e [Zhang et al., 2007]. An example of a (3, 2000)-anonymized database table is given below, in Figure 2.2.

(a) The original salary data records. (b) The anonymized salary data records.

Figure 2.2: The salary data records featured in [Zhang et al., 2007], before and after applying permutation-driven anonymization. Parameters: k = 3 and e = 2000.

(12)

Perturbation and permutation produce probabilistic k-anonymization models. The downside of this is that anonymity is not necessarily guaranteed. The privacy guarantees of permutation anonymization can be improved by using k, e-anonymization, but only numerical attributes are discussed. According to [Zhang et al., 2007], extending the procedures to support categorical attributes should not be difficult. The fact that these methods preserve more of the original database, such as the quasi-identifier attributes, can be seen as an advantage. However, associa-tions between quasi-identifier attributes and sensitive attributes are removed, which can greatly impact the usability of the anonymized database. In Table 2.1 below, the four anonymization methods are summarized to give a quick overview of their properties.

Table 2.1: Overview of the anonymization methods discussed in Section 2.1 and 2.2.

Anonymization method Action taken on data Anonymization model Generalization Replacing information from the original

database by a set of values, or a range, in which the original value is embedded.

Deterministic

Suppression Complete removal of attributes or records from the database. Commonly done for identifier attributes.

Deterministic

Perturbation Alteration of values. Noise can be added to numerical data and a swapping algorithm is applied to categorical data to achieve anonymity.

Probabilistic

Permutation Swapping of sensitive attributes in the database, in order to remove associations between the record and its sensitive at-tribute(s).

Probabilistic

2.3 Information loss

An important aspect of anonymization is to quantify the amount of information that is lost after anonymizing a database. This measure can be used to analyze the effectiveness and quality of an anonymization algorithm. If applying an anonymization algorithm results in a lot of information loss, it becomes more difficult to extract meaningful statistics from the resulting data. It is therefore important that the information loss is kept at a minimum, while guaranteeing the privacy constraints.

2.3.1 Global Certainty Penalty (GCP)

An information loss measure commonly used in combination with k-anonymization of tabular data is the Global Certainty Penalty (GCP). This measure was first introduced by [Ghinita et al., 2007], and is a measure for the information loss over an entire database table. Only the quasi-identifier attributes are included in this calculation, as anonymization is achieved over these attributes. It uses the Normalized Certainty Penalty (NCP) to measure the information loss per attribute of an equivalence class present in the anonymized database [Xu et al., 2006]. For numeric attributes, the NCP is defined as follows, in which Anum is the numeric attribute and

G is the equivalence class:

N CPAnum(G) =

maxG

Anum− min

G

Anum

maxAnum− minAnum

(2.1)

For numeric attributes, the NCP divides the range of values of Anum present in the equivalence

(13)

taken for categorical data. It is assumed that the categorical attributes are connected to a user-defined hierarchical tree, called a Domain Generalization Hierarchy (DGH). An example of such a tree can be seen in Figure 2.3. The nodes in the tree describe the different generalization possibilities. All possible attribute values are represented by the leaves of the tree.

Given a generalized categorical attribute of an equivalence class, consisting of the values {n0, n1,

n2, ...}, we need to find u, the closest common ancestor of all values in the generalized set. The

NCP is then calculated by counting the number of categorical values, that are descendants of u, denoted size(u). This is then divided by the total number of categories for the attribute, |Acat|.

Thus, this is defined as follows:

N CPAcat(G) =

(

0, if size(u) = 1 size(u) / |Acat|, otherwise

(2.2)

For example, given a generalized attribute {Germany, Greece} and the DGH shown below, the closest common ancestor is then Europe. This then gives: N CPAcat=

3

7.

Worldwide

Europe

Germany France Greece

America

USA Canada

Asia

China Japan

Figure 2.3: Example of a Domain Generalization Hierarchy (DGH), which can be used to gen-eralize and calculate the information loss of categorical attributes.

An alternative to DGH is the use of Natural Domain Generalization Hierarchies (NDGH). Cat-egorical attributes of an equivalence class are then generalized into the set of catCat-egorical values present within the equivalence class. There is no predefined set of generalization values, such as in Figure 2.3. Every combination of categorical attribute values is a valid generalization. Thus, size(u) can be replaced by |AG

cat|, denoting the number of unique categorical values in the

equivalence class. This approach results in less information loss compared to DGH, although the notation of a generalized attribute could be seen as less intuitive [Nergiz et al., 2010].

After having defined the NCP for both types of attributes, the NCP can now be calculated for all quasi-identifier attributes in equivalence class G. This gives the following measure for NCP over an equivalence class, visible in Equation 2.3. If applicable, a weight can be included for each quasi-identifier attribute. This is denoted by wi, and P wi= 1. Moreover, d denotes

the number of quasi-identifier attributes.

N CP (G) =

d

X

i=1

wi· N CPAi(G) (2.3)

Now, when applying the NCP over all equivalence classes, the GCP is calculated. Let P denote the set of all equivalence classes present in the anonymized database. The GCP is then defined as shown in Equation 2.4 below. The value of the GCP is always in the range [0.0, 1.0]. The number of records in the equivalence class is denoted by |G|. N denotes the total number of records in the database.

GCP (P) = P

G∈P

|G| · N CP (G)

(14)

2.3.2 Query estimation accuracy

In order to quantify the information loss for perturbation- and permutation-based anonymization methods, the average error in answering (aggregate) database queries is used. This measure can be applied to k-anonymized databases as well to provide a measure for comparing both types of anonymization [Zhang et al., 2007, Aggarwal, 2008].

For this method, a number of queries have to be defined, the result of which can be calculated for the anonymized database table. The queries should contain calculations using the sensitive attributes of the database, as these are affected by the anonymization procedure. This is nec-essary to compare the result of the original and the anonymized database properly. Numerical sensitive attributes are more easily used to calculate the error of the query result. The query estimation error can be defined as follows:

error = |R − R

0_|

R · 100

In which R is the result of the query executed on the original database, and R0 is the query result when executed on the anonymized database. In order to clarify how this equation is used to determine the query estimation accuracy metric, an example will be shown. Given a randomly generated list of row identifiers, X, and a sensitive attribute S, the following aggregate pseudo-queries are defined:

- SELECT AVG(S) FROM original_table WHERE id IN X

- SELECT AVG(S) FROM anonymized_table WHERE id IN X

Using these queries, R and R0 respectively can be calculated and the query estimation error can be calculated. Repeating this process for a certain number of times, using different rows, gives multiple query estimation error values which can be averaged together. The average value is the information loss metric. To gain more insight in the properties of the anonymized database, the size of X can be varied and queries which consider only part of the sensitive attribute’s value range can be experimented with [Zhang et al., 2007, Aggarwal, 2008]. This measure does require manual formulation of queries to execute, which can be seen as a downside.

(15)

CHAPTER 3

Analysis of the IDO-Laad database

In this chapter, the IDO-Laad database is introduced. The tables relevant for the HvA teach-ers and the data that is contained within these tables is shown. Using this knowledge, the anonymization requirements can then be formulated. Conversations with the data owners, the municipalities, have taken place. The results of these conversations and the implications for the anonymization requirements will be discussed. It is also important to determine if these requirements are in line with the expectations of the end-users. Thus, HvA teachers have been interviewed as well. This chapter concludes by classifying the attributes contained in the rele-vant database tables as either an identifier, quasi-identifier, non-identifier attribute or sensitive attribute.

3.1 Data of interest

The IDO-Laad database contains a multitude of tables. One of the most important attributes in the database is the RFID. To be able to charge an electric vehicle, a charging pass is required. Each user is identified by this pass, which is assigned an identifier string, the RFID. There are three tables in the database that provide the best opportunities for research and usage in an educational environment. The first table of interest is the RFID table. It contains limited information about each RFID, such as the user type and the service provider. The user type can, for example, be a normal user, taxi driver or Car2Go user. No personal information such as gender, date of birth or residential address is included in this table, nor anywhere else in the IDO-Laad database. All the attributes available in this table are shown below:

Table 3.1: Overview of the attributes present in the RFID table.

Attribute Example Description

RFID 60DF4D78 A string-typed identifier for the charging card.

RFID ID 7892 A numerical alternative for the RFID value defined above. usetype regular, taxi Indicates the type of user who owns the charging card. Service Provider Essent Company which provided the charging card.

Volt 230V, 440V Indicates the connection type assigned to this card.

The locations table is the next table of interest. It contains records of each charging station of which charging sessions are available in the database. This table contains stations all over the Netherlands. The majority of the attributes in this table contain information about where the charging station is located. This includes the province, city, district, street and postal code. The attributes that are not related to location are summarized in Table 3.2 on the next page:

(16)

Table 3.2: Overview of the important non-location attributes present in the locations table.

Location skey 221 A numeric identifier for the charging station.

Provider Nuon The company which is in charge of the charging station.

The final table of interest is the charging sessions table. Each record in this table contains the data about one charging session. Through the RFID and location skey, a charging session is connected to the two tables discussed above. The table contains information about the time and date the charging session took place, the amount of energy charged and metrics such as connection time and charging time. The most important attributes are summarized in Table 3.3 shown below:

Table 3.3: Overview of the attributes present in the charging sessions table.

ChargeSession ID 7291732 An numeric identifier for the charging session. Location skey 221 A numeric identifier for the charging station. RFID 60DF4D78 A string-typed identifier for the charging card. StartConnectionDT 18-04-2015 11:00 Date the charging session started.

EndConnectionDT 18-04-2015 12:00 Date the charging session ended.

ConnectionTime 0:14:23 Time the car was connected to the charging station. chargingtime 0:02:11 Total time the car was charging.

kWh 0.86 The amount of charged energy in kWh.

City Amsterdam The city in which the charging station is located. Provider Nuon Company which is in charge of the charging station. Address Dorpstraat 1 Address of the charging station.

PostalCode 1057EW Charging station area ZIP code.

District Zuid The city district in which the charging station re-sides.

Longitude 52.3702 Longitude of the charging station’s position. Latitude 4.8952 Latitude of the charging station’s position.

3.2 Anonymization requirements

In order to provide a software application that is able to cater to the teachers’ needs, it is first necessary to determine the anonymization constraints. The data that will be published to the teachers should comply with the anonymization requirements that the municipalities could have. In order to determine the anonymization requirements, meetings with representatives of the mu-nicipalities have taken place. To compare the compatibility between anonymization requirements and end-user expectations, HvA teachers have been interviewed as well.

In the time span of the graduation project, it was possible to meet with both the municipality of Den Haag and the municipality of Amsterdam. Both municipalities stressed the importance of the RFID attribute, which links together multiple charging sessions of an individual user. Since it might be possible to link the RFID string to a person’s identity, this attribute should be anonymized. The possibility of only masking the RFID values, instead of fully anonymizing it was discussed, and both municipalities indicated that this would suffice. This means that charging sessions can remain linked to their corresponding RFIDs.

The municipality of Amsterdam specified further requirements, such as anonymizing the company that supplied the charging card and the charging volume of each charging session. Anonymizing these attributes is necessary, as analysis of the charging sessions could reveal statistics about the business case of electric vehicles; the amount of energy charged combined with energy prices could

(17)

reveal the revenue made by energy companies. Anonymization of the other attributes present in the IDO-Laad database is not necessary, according to the municipalities. The reason for this is that the information contained within these attributes is already being publicly released or publicly available, or do not contain data that could lead to identification of any sort.

Finally, the municipality of Amsterdam formulated a pair of requirements that are partly de-sign requirements. It was requested, if possible, to incorporate a system in the application that would notify the municipality when anonymous data was generated. Such a notification would elaborate on what data is released, who receives the data and for what purpose it was generated. Moreover, the degree of anonymization enforced upon the data should be adjusted depending on the research project at hand. If the authenticity of the data is not important, then the degree of anonymization should be higher, and vice versa.

It is important to gain understanding about the expectations of the HvA teachers, to see if they match with the anonymization requirements. Two teachers agreed to talk about their ex-pectations and requirements in terms of database alteration. The latter will be discussed in Section 4.2. In both conversations, the teachers stressed that the database should retain its statistical meaning. According to the teachers, all data was of use, except for the owner of the charging station. Location-related attributes, in combination with the amount of charged energy and the RFID attribute were explicitly named, as analyzing these attributes provides the most information. Comparing the expectations of the teachers with the anonymization requirements, there fortunately is a significant overlap. Most analysis can still be performed after anonymizing according to the requirements. The fact that the connection between users and their RFIDs can stay in tact, makes the data usable for most if not all end-users.

3.3 Classifying the data fields

With the gathered anonymization requirements in mind, it is now possible to classify the data fields for each table of interest in the IDO-Laad database. This includes all attribute types fea-tured in the theoretical background, which includes identifiers, quasi-identifiers, non-identifier attributes and sensitive attributes. Based on these classifications, an anonymization algorithm can be applied to the IDO-Laad dataset.

In the IDO-Laad database, two types of identification are present that require protection: the identity of the charging card owner, and inferred information about energy companies, such as their revenue. The former can be ascribed to the RFID attribute, which links to a single charging card and possibly to the individual who owns the charging card. The latter covers the charging card provider, the Service Provider attribute and the charging volume of each session, the kWh attribute.

The RFID attribute can be seen as a quasi-identifier attribute, as it could be linked to other data to identify a person. In case k-anonymization is applied, the attribute would be general-ized. If perturbation or permutation would be applied, the attribute would be left unchanged. Both approaches are less than ideal, as they either distort the connection between the user and their records, or leaves it unanonymized. As proposed during the meetings, the best approach would be to mask the attribute, to keep the association in tact and prevent releasing the orig-inal RFID without anonymization. The charging card provider and charging volume are both quasi-identifier attributes as well. This makes perturbation and permutation less applicable as anonymization approaches, as the values would remain unchanged.

The anonymization requirements can now be defined for each table of interest. Starting with the RFID table, it means that RFID and Service Provider attributes require anonymization. The location table requires no anonymization and the charging sessions table needs to be anonymized for the RFID and kWh attributes. All other attributes are classified as non-identifier attributes, and thus require no anonymization. This will be the minimal anonymization strategy.

(18)

One could argue this is a limited amount of anonymization, and that it could perhaps be achieved with less complex strategies than those proposed in the theoretical background section of this thesis. However, the software product should feature variable anonymization, as required by the municipality of Amsterdam. Therefore a robust anonymization algorithm needs to be imple-mented to account for functionality that goes beyond the minimal requirements described above. Additionally, the product should be future-proof. Research into electric vehicles is expected to increase in the future, certainly when considering that the municipality of Amsterdam wants to ban gas-powered vehicles from the city center [Municipality of Amsterdam, 2015]. The creation of anonymization systems like the one proposed in this thesis is also an incentive for the electric vehicle research community to share more personal data. Using this information, more in-depth research could be conducted. With effective anonymization algorithms in place, the privacy of electric vehicle users should still be guaranteed.

(19)

CHAPTER 4

Application design

After defining the anonymization requirements for the software application, the next step is to define the application design, and the requirements the design should meet. The target environment in which the software application needs to run is the first design requirement, as the software itself must be compatible with the environment. After discussing the target environment, an important part of this chapter will discuss the functional requirements for database alteration set by the teachers at the HvA, which are the end users. As mentioned previously, meetings with teachers have taken place to discuss these requirements. The chapter concludes by giving an overview of the final application pipeline.

4.1 Target environment

The target environment of the software application is the computational server of the IDO-Laad research project. The server is a secure environment in which data analysis can be performed. Data from the IDO-Laad database can only be obtained from within the computation server environment, and is stored in a Microsoft SQL 2014 database.

The computational server supports two programming languages: R and Python. There is no possibility to execute scripts from the command-line directly. In order to write code and per-form analysis, two integrated development environment (IDE) web services are available on the server for both languages: RStudio and Jupyter. Other than just executing code, Jupyter allows creating interactive documents in which code, visualization and documentation (in Markdown) can co-exist. To prevent data leakage, it is required that the software solution runs completely inside the computational server environment, using the aforementioned IDE’s. Even though the IDO-Laad research group uses R as its preferred programming language, the choice was made to develop this software application in Python, using Jupyter. The reason why Python is preferred is mainly because of prior programming experience. The graduation project is time-constrained, so it is preferred to focus time on doing research and avoid spending time on anything else.

This does, however, have some implications in terms of retrieving data from the IDO-Laad SQL server. At the start of the project, the only way to retrieve data from the SQL server was through an R wrapper. Fortunately, using the rpy2 Python package a Python wrapper script was created that emulated the functions in the R wrapper. Most of this work was done by Peter Verkade, a fellow student from the UvA who is also conducting his graduation project as part of the IDO-Laad project.

The software application produced should thus be a Python application, provided as a Jupyter script, in which the user can modify parameters and run the code sections to execute the pro-cess of anonymization and alteration. This is not exactly the most user-friendly interface, as it involves changing parameters which are embedded in lines of code. Making a fully working user

(20)

interface such that it is more intuitively usable for the HvA teachers is not within the scope of this project. The Jupyter script is, however, directly usable for my project supervisor. He will gather requirements for the output database and will use the script for the teachers.

4.2 Analysis of functional requirements

When designing an application, it is important to get to know which functionality should be supported. Anonymization is one functional requirement that required requirement analysis of its own. In this section the meetings with HvA teachers will be discussed, from which the re-quirements for alteration of the anonymized database can be determined.

During the meetings with the two teachers who agreed to discuss their data alteration require-ments, a number of desirable features were named. An important requirement was to be able to generate multiple databases containing a subset of the records from the original database. The records in these subset databases must provide a good representation of the original database. Moreover, how these subset databases are generated should be customizable. The customization requested included considering only charging sessions that took place in a certain range of years, and choosing to include records only from certain districts of the city. This makes using data from the IDO-Laad dataset more convenient as a small partition of the whole dataset takes up less space. Moreover, creating multiple databases makes it more difficult for students to copy results from other groups or other students, as they will not have the same database.

Given the three tables of interest, it was requested to make sure that only relevant records are present in these tables. This means that the RFID and locations tables should only contain records of charging cards and charging stations that appear in the charging sessions database table. Considering that the locations table contains charging stations all over the Netherlands, it is to be expected that when anonymizing the IDO-Laad data of Amsterdam only, this would cut down the amount of records significantly. Another requirement was making the application easy to use for the teachers themselves, by providing a robust interface for them to easily configure and generate databases. As mentioned in the previous section, this is not within the scope of this project, and thus this requirement will not be fulfilled.

My project supervisor had a number of feature requests as well. This included permuting at-tributes of the anonymized database tables, removing certain atat-tributes and generating additional records (which will contain fake data) for the charging sessions table. A restriction to applying these methods is that the k-anonymization requirement should still hold afterwards. This means permuting one of the quasi-identifier attributes is not possible. This would create very different equivalence classes, which are not guaranteed to be of a size larger than k. Moreover, at least k records should be constructed when creating new charging sessions to ensure k-anonymity is guaranteed.

4.3 Application pipeline

This section features an overview of the final application pipeline. Step by step, a typical iteration of the pipeline is traversed to explain how the system works. The application has been implemented in a modular fashion, and it is thus possible to assemble the pipeline differently if desired. This will be further elaborated upon in the next chapter, implementation details. In Figure 4.1 below, the pipeline is visualized using a flow chart.

(21)

Figure 4.1: The final application pipeline, providing an overview of how the application typically executes. The orange states are actions defined within Jupyter, using Python source files saved to disk. The numbers indicate the order of actions taken to anonymize and alter a database.

The user first starts the script, after which all the calculations will take place in the Jupyter environment. Each step is executed using a module, which is called from within the Jupyter environment. Each action will return to the Jupyter environment to save the result of the action.

The first step is retrieving the IDO-Laad database data, using the Python wrapper defined above. The three tables of interest, RFID, locations and charging sessions are retrieved from the Microsoft SQL Server. Then, using user-defined parameters, the data is anonymized. After anonymization, the k-anonymity requirement is validated and feedback about whether it was successful is returned to the user. The exact process of anonymization is discussed in the next chapter. Afterwards, the anonymized database tables are altered to cater to the needs of the HvA teacher. The final step is writing back the resulting database(s) to the Microsoft SQL Server.

(22)

(23)

CHAPTER 5

Implementation details

Now that the design and anonymization requirements are defined, the next step is to implement the system as requested. This section features an extensive explanation of the anonymization and alteration pipeline implemented for the software product. At first, the structure of the finished application is briefly discussed. Subsequently, the anonymization component of the application is thoroughly discussed. The first step is choosing an appropriate anonymization technique and anonymization algorithm, based on the requirements set in the previous chapters. The chosen algorithm is then explained. Furthermore, the implemented functions for altering the anonymized data are discussed.

5.1 General application structure

The resulting software application has been implemented in a modular fashion. A number of modules have been created, which do separate tasks within the whole application pipeline. Because it is modular, certain steps in the pipeline can be re-ordered, if required. These modules are combined into an anonymization application using a Jupyter script, which calls the different modules when necessary. Processing the large amounts of tabular data in Python has been simplified by using the Pandas package. In Section 4.3, the design of the application pipeline has already been discussed. It shows the main modules implemented for this application. In Table 5.1 below, all the module source files are enumerated, along with a description of the functionality contained within the module.

Table 5.1: Overview of the modules implemented for the application.

Source file Description

load dataset.py Module in which a different number of datasets or databases can be loaded for use in the Jupyter script, among which the IDO-Laad database.

anonymize.py In this module, the anonymization algorithm and information loss metric is implemented. Passing a database table to the appropriate function returns the anonymized version of the database.

customize dataset.py Implementation of the database alteration functions requested by the teachers.

db client.py In order to write back the results to the Microsoft SQL server, an ODBC client has been created. Basic functionality as creating and dropping tables has been implemented.

credentials.py This small module takes care of storing the necessary credentials. error module.py Centralized error message printing, which cleans up other modules.

The functionality contained within the anonymize and customize dataset modules will be further discussed in the next two sections.

(24)

5.2 Database anonymization

In the theoretical background chapter of this thesis, the different anonymization approaches have been discussed and evaluated. The most suitable anonymization approach can now be chosen by comparing the properties of both approaches to the anonymization and design requirements set in the previous chapters.

When comparing the k-anonymization to perturbation and permutation, it can be seen that perturbation and permutation is more focused on not publishing the exact values of sensitive data, while the quasi-identifier attributes will remain unchanged. This means that when using these methods, it is of more importance to protect the disclosure of sensitive data, such as a patient’s disease, but it does not protect against identification of the individual or entity of the record. In literature, this is described as membership disclosure. This means that when an entity is identified, all but the sensitive attributes can be immediately linked to that entity. When considering the anonymization requirements, it is not desirable to leave the quasi-identifier attributes unchanged. The municipalities have indicated these attributes must be anonymized properly.

Anonymization of the required attributes is more properly done using k-anonymization, in which the identifier attributes are the main focus of the anonymization strategy. The quasi-identifier attributes are generalized into ranges or sets of values, after which each combination of quasi-identifier attributes appears in at least k records in the database. A downside is that the representation of the attribute values changes, which can have implications on the process of conducting data analysis using the anonymized database. This is a disadvantage that can be tolerated, however. The sufficient anonymization of the IDO-Laad database is the most impor-tant. And the comparison of anonymization techniques show that perturbation and permutation cannot provide that.

Taking into account the advantages and downsides of the anonymization methods discussed above, the choice has been made to incorporate k-anonymization in the implementation of the anonymization algorithm of the project. This means that suppression and generalization will be used to implement k-anonymization. In addition to the discussion above, the fact that these methods and applicable algorithms are more thoroughly documented makes suppression and generalization the most suitable choice.

5.2.1 Choosing an anonymization algorithm

In this section, k-anonymization algorithms present in related research are briefly discussed and evaluated, after which the algorithm of choice is further elaborated upon. This includes an exten-sive explanation of the algorithm, supported by examples to visualize the anonymization process.

When considering k-anonymization algorithms, it is important to define certain requirements regarding performance and solution quality. Most of the times, a trade-off is present between these metrics. Given the IDO-Laad database, optimal performance is important, as the database contains approximately 2 million records, with roughly a dozen attributes present in each record. Moreover, there are a number of properties of algorithms which determine whether it is suitable for certain use-cases. One of those properties is the type of recoding, which is the transformation of original data to anonymized data. For generalization and suppression anonymization algo-rithms, there are two recoding models: global recoding and local recoding [Ghinita et al., 2007].

Global recoding maps equal instances of values to the same generalized value, for all records. This approach is simple, and achieves a more consistent anonymized database, in terms of possi-ble values that occur. Local recoding is more flexipossi-ble, as it allows multiple occurrences of equal value instances to be mapped to different generalized values. Different rules can be applied, with as result that some instances of values are generalized into for example a range, whilst other instances do not have to be generalized at all. Comparing these models, the observation can

(25)

be made that the flexibility of local recoding generally results in less information loss. It does result in heterogeneous attribute values, but minimizing the information loss should be prior-itized [Ghinita et al., 2007, Ayala-Rivera et al., 2014]. Moreover, there is also the distinction between single-dimensional and multidimensional recoding. Single-dimensional recoding applies anonymization per attribute, while multidimensional recoding applies anonymization over all quasi-identifier attributes at the same time.

In Figure 5.1 below, the results of applying anonymization using single and multidimensional recoding, in conjunction with global recoding can be seen. The original data records used in the example are again the patients data records used earlier in Figure 2.1a. The difference in equivalence classes created is telling; those created using single-dimensional recoding are far more generalized and thus suffer more information loss.

(a) Single-dimensional recoding. (b) Multidimensional recoding.

Figure 5.1: The patient data records featured in [LeFevre et al., 2006], applying single- and multidimensional recoding. Parameters: k = 2 and quasi-identifier = {age, sex, zip code}.

Furthermore, there is the difference between complete algorithms and greedy algorithms as well. Greedy algorithms try to cut back on computation time by calculating an approximation of the real answer given a certain input, instead of calculating the exact answer, which generally takes more time. It is inevitable that some of the solution quality is lost when using a greedy algorithm, but it generally does not equal significantly inaccurate output.

Having identified a number of key characteristics for k-anonymization algorithms, it is now pos-sible to compare the different algorithms. Following the discussion in [Ayala-Rivera et al., 2014], three algorithms are being compared for execution time, memory usage and a number of infor-mation loss metrics. A commonly used dataset in related research, the Adult1_{dataset, alongside}

a synthetic dataset called the Irish dataset are used to compare the algorithms. The following algorithms were compared: Incognito, Datafly and Mondrian. A summarized comparison of all three algorithms for all metrics used can be seen in Figure 5.2 below:

Figure 5.2: Comparison of k-anonymization algorithms for the Adult dataset and the synthetic Irish dataset, as featured in [Ayala-Rivera et al., 2014]. Incognito, Datafly and the Mondrian algorithm are compared on execution time, memory usage and information loss.

(26)

Incognito is a complete, single-dimensional, global generalization algorithm, which is visible in the execution time results. Both in Figure 5.2 as in the more detailed graphs featured in [Ayala-Rivera et al., 2014], Incognito performs the worst. When using three or more quasi-identifier attributes, the execution time is 10 to 100 times slower than the other algorithms used. Regard-ing information loss, one would expect to see low information loss, given the complete nature of the algorithm. This is not the case, however. In general, it is the worst or second-worst performer in this category as well. Datafly is a greedy, global, single-domain generalization algo-rithm. In terms of execution time, Datafly is a major improvement when compared to Incognito. The information loss metrics are at best equally good as Incognito, and sometimes worse. The Mondrian algorithm seems to perform the best over all metrics; the least amount of information loss combined with fast execution time. This can be contributed to the fact that it is a greedy algorithm, and it applies multidimensional recoding.

Performance of a k-anonymization algorithm is an important characteristic when applying it on the IDO-Laad database. Moreover, information loss should be kept to a minimum, as it must be possible to conduct data analysis on the anonymized database. Still being able to uncover associations and gather useful information from the anonymized database is an important re-quirement of the end-users. The Mondrian k-anonymization algorithm shows that it can fulfill those requirements the best. In the next section, the Mondrian algorithm will be thoroughly explained.

5.2.2 Mondrian k-anonymity

The Mondrian k-anonymization algorithm was first introduced by LeFevre et al, in 2006. It is a recursive greedy multidimensional partitioning algorithm, which achieves k-anonymity over the quasi-identifier attributes of a database table. It partitions the domain of the quasi-identifier attribute values into sections containing k or more records, which are the equivalence classes men-tioned in the theoretical background. The algorithm is best applicable for numerical attributes, but can be extended to support categorical values, which is discussed in the next section. The algorithm starts with the whole database table as partition, after which this partition is cut in two smaller partitions, which are recursively cut in two partitions again and again, until the k-anonymization principle is violated after making a cut. If that is the case, the pre-cut partition is an equivalence class of the anonymized database. Each equivalence class is generalized using the range or set of quasi-identifier attribute values in the equivalence class itself [LeFevre et al., 2006, Ayala-Rivera et al., 2014].

The Mondrian algorithm can be applied using two types of partitioning: strict and relaxed partitioning. Strict partitioning cuts the database in non-overlapping partitions, using global recoding. Relaxed partitioning is more flexible, as it uses a local recoding scheme, and produces partitions that possibly have an overlap in the generalized quasi-identifier values. Algorithm 1 shows the Mondrian k-anonymization algorithm, using strict partitioning [LeFevre et al., 2006].

Algorithm 1 Mondrian k-anonymization; strict multidimensional partitioning

1: _{procedure Anonymize(partition)}

2: if no allowable multidimensional cut for partition then

3: return φ : partition −→ summary

4: else

5: dim ←− choose dimension()

6: f s ←− frequency set(partition, dim)

7: splitV al ←− find median(f s)

8: lhs ←− {t ∈ partition : t.dim ≤ splitV al}

9: rhs ←− {t ∈ partition : t.dim > splitV al}

10: return Anonymize(rhs) ∪ Anonymize(lhs)

11: end if

12: return

(27)

Using the Mondrian algorithm shown in pseudo-code above, the process of partitioning can be explained. It will first be explained for strict partitioning. The algorithm starts with the whole database table as partition. The first step is to determine whether an allowable multidimen-sional cut can be made for the current partition. Cutting the partition in two parts is always done based on the domain of one of the quasi-identifier attributes. To determine whether an allowable cut is possible, for every quasi-identifier the attempt is made to apply the cut. A cut is allowable or valid if the two resulting partitions meet the k-anonymity requirement [LeFevre et al., 2006]. If one or more of the quasi-identifiers produce a valid cut, logically, an allowable cut is possible. If this is not the case, the partition is generalized and will form an equivalence class in the anonymized database table.

In the “else” case, the process of applying a cut is described. Starting from line 5 in the al-gorithm, first the quasi-identifier that will be used to cut the partition, dim, is determined. One of the quasi-identifiers that produced a valid cut will be chosen, using the attribute’s normalized domain width as heuristic. The normalized domain width is equal to the NCP measure, dis-cussed in Section 2.3.1. The quasi-identifier attribute with the largest normalized domain will be used to cut the partition. On lines 6 and 7 of the algorithm, the median value of the selected attribute’s domain within the partition, the splitVal, is determined. This value is then used to divide the records in the partition into two smaller partitions, lhs and rhs, as shown at lines 8 and 9 of the algorithm. Finally, the recursive call is made to execute the same process for the newly created partitions until the k-anonymity requirement is violated.

When applying relaxed partitioning, the algorithm changes slightly, and some simplifications can be introduced to speed up the anonymization process. Instead of putting the records whose dim value is equal to splitVal in lhs, they are now divided evenly over the two partitions. This means that lhs and rhs are of equal size, or one of the partitions contains one record more. As a result, the condition on which is decided whether an allowable multidimensional cut is possible for the current partition, is significantly simplified. If the current partition contains more than 2k −1 records, a cut is possible, by definition, for all quasi-identifier attributes. Creating a cut for a partition with 2k records results in two partitions of k records, which satisfies the k-anonymity requirement.

A visualization of the strict partitioning process can be seen in Figure 5.3 below. Again the patient data records are used. Each point in the partitioning space is a record from the original database table. The source of the figures is [LeFevre et al., 2006]:

(a) Original patient data records. (b) Partitioned patient data records.

Figure 5.3: The patient data records featured in [LeFevre et al., 2006], in which each record is a dot in the images. Mondrian k-anonymization with strict partitioning is applied. Parameters: k = 2 and quasi-identifier = {age, zip code}.

In the example above, the first cut is made using the zip code attribute. This is an ambiguity in the algorithm, however. The first cut in the Mondrian algorithm is always ambiguous if multiple quasi-identifier attributes can produce a valid cut. In the example, both age and zip code can

(28)

produce an allowable cut. When comparing the normalized domain width of both attributes, it becomes apparent that both values are equal to 1.0. And thus, no unambiguous choice can be made; how to proceed from here is implementation-dependent. For the application discussed in this thesis, the quasi-identifier attribute names are sorted alphabetically if the values are equal. Thus, the age attribute would have been chosen.

Continuing the example; the median zip code value is calculated and the cut is applied, which gives the vertical line as visible in Figure 5.3b. The next step is partitioning the newly created partitions. The right-hand side only contains two records, and no allowable cut is possible. The left-hand side does have an allowable cut, for the age attribute. The median of the age attribute is calculated and the resulting cut is represented by the horizontal line. The two newly created partitions can both not be further cut.

5.2.3 Extending Mondrian k-anonymity for categorical attributes

In order to extend the Mondrian algorithm for categorical attributes, the most commonly used method is creating Domain Generalization Hierarchies, as discussed in Section 2.3.1. The down-side of this is that these trees have to be manually created, which can be an annoyance when the contents of databases change or when new tables are added which will require new DGH. Thus, the choice has been made to implement NDGH to support categorical attributes, which gener-alizes into the set of unique categorical values present in the equivalence class [Nergiz et al., 2010].

To support categorical attributes, the implementation of the algorithm has to be extended or modified on a number of positions. The categorical NCP calculation has been implemented to determine the correct attribute to cut with and calculate information loss. The partitioning pro-cess for categorical attributes has been implemented by mimicking the behavior of anonymizing numerical attributes, which uses the median value of the sequence of numbers to cut the partition in two. The most closely related operation for categorical attributes would seem to alphabetically order the categorical values present in the partition, and choose the central categorical value. This median categorical value will then be used as splitVal. Support is added for both strict and relaxed partitioning.

In order to support relaxed partitioning, the process of cutting the partition stays the same, implementation-wise. The values of the attribute selected to perform the cut with are ordered alphabetically, which can be done using the same function call in Python. The median (categor-ical) value is selected, after which the records can be evenly divided over two new partitions. In terms of strict anonymization, a little more effort is required. In the same manner, the median is selected. At that point, all values equal and smaller than the median have to be selected. For categorical attributes, the equivalent operation is selecting all values that are equal to the median and those with a lower index (in the partition) than the index of the first occurrence of the median value. The values with an index higher than the index of the last occurrence of the median value make up the other partition. To clarify the partitioning process, an example has been visualized for both partitioning types in Figure 5.4.

Figure 5.4: Visualization of the partitioning process for both strict and relaxed partitioning. The t-values are the attribute’s values in the partition, alphabetically sorted. Red and green denote the new partitions, after applying the cut.

(29)

The example shows the categorical attribute’s values in the partition after alphabetic sorting has already taken place. The median value is t3, which means that for relaxed partitioning,

the red and green values will form two new partitions. In case of strict partitioning, all values are allocated into one partition. This example shows the drawback of strict partitioning with categorical values. When the attribute’s values that occur in the database are distributed very non-uniformly, there is a high chance the median is equal to the most occurring value. The result of this is that one partition is very large, the other non-existent or very small in size, which often results in the k-anonymity requirement being violated [Ayala-Rivera et al., 2014].

5.3 Database alteration functionality

In Section 4.2, the functional requirements for the software application were discussed. Most of these requirements were implemented in the final application. This section discusses the imple-mentation of the requirements, and explains why certain requirements were not fulfilled.

The ability to create multiple subset databases out of the original database has been imple-mented, including the requested customization. Parameters can be passed to the function to only select records from a given range of years and a select set of districts. In order to select the correct amount of records from each included district, the distribution of districts over the original database table is calculated. A drawback of this strategy is that when generating a database table with n records, given a relatively small n, the resulting table can contain less than n records. This is because when calculating the number of records to select, the number is always rounded to an integer. Moreover, functionality to delete and permute attributes of the anonymized database was added, given the restriction that quasi-identifiers cannot be permuted. Perturbation-based generation of records (which contain fake data) is supported as well.

Before anonymizing the charging sessions database table, the RFID and locations tables are altered to only include RFIDs and charging stations that appear in the charging sessions table. This has to be done before anonymizing, to take advantage of the deleted rows during anonymiza-tion. During this process, the RFID attribute can be masked as well. The unique RFIDs are mapped to a simple integer representation, from 0 to n for n unique RFIDs. This mapping can then be used to select the correct records and mask the RFID attribute. It is created using the charging sessions table and applied to the RFID table. In case the original RFID value needs to be retrieved from an anonymized record, the RFID ID in the original RFID table or the ChargeSession ID in the original charging sessions table can be used.

One requirement that has not been implemented is the request from the municipality of Ams-terdam to include a notification system, which notifies the municipality when anonymized data is released. This requirement was partly an anonymization requirement and partly a functional requirement, discussed in Section 3.2. Due to time constraints it was not possible to include this requirement in the scope of the project.

The implemented functionality can be used modularly as well. For example, attributes can be deleted from the anonymized database, but the same can be done for the generated subset databases. In the current implementation of the Jupyter script a more standardized pipeline is set up, such that users do not have to heavily modify code (such as adding function calls). But in practice, the functionality is there.

(30)

(31)

CHAPTER 6

Experiments

In this chapter, the results of experiments that were conducted are shown. The experiments are focused on testing the performance and quality of the anonymization process of the implemented software application. The alteration component of the application is not included in the experi-ments, as the related functionality is not computationally complex and is thus not relevant. The chapter concludes by showing records from the IDO-Laad database, before and after applying anonymization.

6.1 Measuring anonymization performance and quality

In order to get a general impression of the Mondrian k-anonymization algorithm’s performance, tests have been conducted using the Adult dataset, which was mentioned in Section 5.2.1. The dataset contains 32561 records in total, with 15 attributes per record. For the meaning of all attributes in the Adult dataset, please refer to the online documentation1_. _Afterwards,

the IDO-Laad database is tested as well. Both strict and relaxed partitioning are used in the experiments, if applicable, and all results are an average of three measurements. In the figures shown, anonymization time denotes the time necessary to execute the anonymization process.

In Figure 6.1 below, the anonymization time and information loss (GCP) are measured using a number of different values of anonymization parameter k, in combination with a numerical quasi-identifier: {age, fnlwgt}.

0

20

40

60

80

100 k

-value

0

20

40

60

80

100

120 Anonymization time (s)

Strict partitioning

Relaxed partitioning

(a) Execution time for different values of k.

0

20

40

60

80

100 k

-value

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040 Information loss (GCP)

Strict partitioning

Relaxed partitioning

(b) Information loss (GCP) for different values of k.

Figure 6.1: Mondrian k-anonymization of the Adult dataset, using different values of parameter k and a numerical quasi-identifier: {age, fnlwgt}.

Altering electric vehicle charg- ing database for educational use

Bachelor Informatica