Differential Privacy and Model Extraction Attacks on Deep Neural Networks

(1)

on Deep Neural Networks

(2)

(3)

Extraction Attacks on Deep

Neural Networks

Analysing the effects of Differential Privacy on the

performance of Deep Neural Networks, as well as the

implications for Model Extraction Attacks.

Serge A. van Haag 11773324

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098XH Amsterdam Supervisor S. Amiri (MSc)

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098XG Amsterdam

(4)

list of figures

Figure 1 General model extraction attack flowchart . . . 21

Figure 2 Adversarial example created to trick MNIST classifiers . . . . 23

Figure 3 MNIST: Relationship between and accuracy . . . 33

Figure 4 MNIST: Accuracy without Differential Privacy . . . 34

Figure 5 MNIST: Performance on MNIST query set. . . 36

Figure 6 MNIST: Model extraction by using the MNIST query set . . . 37

Figure 7 MNIST: Adaptive sampling on the MNIST query set . . . 38

Figure 8 MNIST: Performance on EMNIST query set. . . 39

Figure 9 MNIST: Model extraction by using the EMNIST query set . . 40

Figure 10 MNIST: Adaptive sampling on the EMNIST query set . . . 41

Figure 11 MNIST: Performance on FashionMNIST query set. . . 42

Figure 12 MNIST queried with the FashionMNIST query set . . . 43

Figure 13 MNIST: Model extraction by using the FashionMNIST query set . . . 43

Figure 14 MNIST: Adaptive sampling on the FashionMNIST query set . 44 Figure 15 MNIST: Performance on FakeData query set. . . 45

Figure 16 MNIST: digits can resemble the eight . . . 46

Figure 17 MNIST: Model extraction by using the FakeData query set . . 47

Figure 18 COVIDx: Complexities of class imbalance. . . 49

Figure 19 COVIDx: Undersampling variants . . . 50

Figure 20 COVIDx: Non-DP classifier performance . . . 50

Figure 21 COVIDx: DP classification results. . . 51

Figure 22 PNEUMONIAx: Non-DP classification results. . . 52

Figure 23 PNEUMONIAx: DP classification results. . . 53

Figure 24 PNEUMONIAx: DP classification results in detail. . . 54

Figure 25 PNEUMONIAx: Reduction . . . 55

Figure 26 PNEUMONIAx: Non-DP classification results on reduced training data. . . 55

Figure 27 PNEUMONIAx: DP classification results in detail. . . 56

Figure 28 Comparing datasets . . . 57

Figure 29 MNIST: Convolutional layers . . . 61

Figure 30 COVIDx database class proportions . . . 66

Figure 31 MNIST: Adaptive sampling on the FakeData query set . . . . 67

Figure 32 COVIDx: Performance of COVID-Next on an independent test set. . . 68

Figure 33 PNEUMONIAx: Results of a noisy DP model over the course of 300 epochs. . . 68

Figure 34 PNEUMONIAx: DP classification results averaged. . . 69

Figure 35 MNIST database class distribution . . . 69

Figure 36 MNIST: Performance on FakeData query set. . . 70

list of tables

Table 1 Query effects and differences on data . . . 15

Table 2 Example database X . . . 16

Table 3 Posterior belief on Kqage with different values for . . . 18

Table 4 Posterior belief on Kqimage with different values for . . . 18

Table 5 Example predictions P . . . 29

Table 6 Samples per label in CovidX database . . . 31

Table 7 Hyperparameters of DCNN trained on MNIST. . . 32

Table 8 MNIST: Victim model performance . . . 34

(6)

Table 10 Hyperparameters of ResNet trained on chest x-rays. . . 48 Table 11 Samples per label in PNEUMONIAx database . . . 51 Table 12 COVIDx database . . . 65

(7)

acknowledgements

First of all, I would like to thank my supervisor Mr. S. Amiri for providing valuable guidance and attentive feedback. I found that he is very enthusiastic and has a great amount of interest and knowledge in the field of AI, which inspires me.

Furthermore, I would like to thank my family for supporting me during this period. As well as many friends that helped me.

Additionally, in these times of the COVID-19 pandemic, I would like to thank the many people that positively contribute to reducing the impact of the ongoing pandemic in the broadest sense. This includes also the people that participate in keeping study spaces, such as libraries and universities, safely opened. I have stud-ied at the University of Amsterdam everyday over the course of two months and it was key in finalising the work that had to be done. These facilitating services meant more then a lot to me.

Last but not least, I would like to give a special thanks to my fellow student and more than anything else, good friend, F. van Leeuwen for proofreading this thesis multiple times.

abstract

The primary aim of this research is two-fold. First off, the general effects of data differential privacy (DP) of Deep Neural Networks (DNNs) are explored in relation to the potentiality of model extraction attacks. Secondly, it is explored whether DP can ensure data privacy, as well as data utility, in practical situations that arise when a DNN is trained on high dimensional data.

To answer these questions, multiple experiments were conducted on multiple DNNs that satisfied different levels of DP.

The results do not indicate that data DP also ensures model DP. Furthermore, our practical experiments reveal an extreme privacy-utility trade-off; a trade-off in which reasonable data DP could not be guaranteed unless almost all utility of the DNN was lost.

(8)

glossary

In this paper a lot of technical jargon is used. The terms that are described below are consistently used throughout the paper to prevent potential confusion. This is by no means an all-encompassing glossary. Only the terms that were deemed most relevant have been described in this glossary.

adversarial examples Samples that are designed to cause a classification model to make an incorrect prediction.

adaptive sampling A querying strategy in which the attacker queries the victim model selectively, and the attacker adapts the selection criteria on the fly. This strategy can be adopted to discover a subset of queries that are more informa-tive than when random sampling is used.

attacker The attacker has malicious intentions and aims to attack the victim model by performing model extraction.

black-box model A black-box model provides only query access to its users. confidence values Confidence values are the set of prediction probabilities for each

class that a model returns upon being queried with an arbitrary input (Fredrik-son, Jha, and Ristenpart2015, p. 1322).

convolutional neural network A class of DNNs that are commonly used in computer vision to analyse visual images.

deep learning Deep learning refers to the family of Machine Learning (ML) mod-els, based on neural networks, that can learn representations autonomously. differential privacy Differential privacy (DP) is a mathematical definition of

pri-vacy that allows public sharing of information about a dataset while guaran-teeing the individual privacy of the data points in a dataset to an arbitrary degree. In this paper, DP was implemented by using the moments accountant method to achieve (, δ)-DP.

loss function This is commonly named as the objective function or cost function too. This function calculates the loss between the output of a model and the desirable output (as specified by the labels in supervised learning). Therefore, it determines the error of the model.

model In this paper all models used are machine learning models, in specific deep neural networks (DNNs).

model extraction attack A model extraction attack is an attack in which the at-tacker obtains a new model, whose performance is equivalent to that of the victim model, via query access to the victim model (Tramèr et al.2016, p. 601).

optimiser The optimiser updates the parameters of a model in response to the output of the loss function. It aims to minimise the loss function.

prediction The model makes a prediction based of the confidence values. Explicitly, it predicts the label that is associated with the highest confidence value.

query A model can be queried with inputs to predict an output.

query access Query access is a limited type of access to a model, in which the person with query access can only supply the model with inputs and read the outputs that the model returns, without having knowledge of the internal work-ings of the model.

(9)

query budget The query budget defines the maximum amount of queries that an attacker may use to query the victim model.

query set A query set is a set of queries that can be used to query a model.

random sampling A querying strategy adopted by the attacker to query the vic-tim model with random samples that originate from the query set.

supervised learning The task of learning a function that maps an input to out-puts based on example input-output pairs. These input-output pairs are feature-label pairs.

transfer set A transfer set is a set that the attacker constructs by querying a victim model with a query set and saving the query and victim model output pairs. The victim model output can vary under different settings. Some models provide users with confidence values and the associated prediction, while other more restricted models provide the user only with the prediction.

victim The victim is the creator of a model. He becomes a victim, rather than a creator, when an attacker has gained query access to this model.

white-box model A white-box model allows more access than a black-box model, as it allows the user to analyse the internal workings of the model.

(10)

1 introduction

In recent years, the use of machine learning (ML) models has increased rapidly. ML has shown to be beneficial in numerous ways (Wuest et al.2016). Simultaneously,

the volume of gathered data has intensely increased (Chen, Mao, and Liu2014). As

a consequence, the interest in ML on large volumes of data has increased and ML is considered as indispensable to turn big data into real value (Zhou et al. 2017,

p. 358). New ML models are constantly being created and published.

A published model generally takes the form of either a white-box model or a black-box model (Oh, Schiele, and Fritz2019). A model is called a black-box model

if it takes inputs and returns an output, while the inner workings of the model remain undisclosed; it can be described in terms of input and output pairs, without any knowledge of its internal workings. In contrast, if a model is white-box, then the internal workings of the model are obtainable.

Several companies train and publish their ML models to provide access to others for monetary gains. This is commonly referred to as Machine Learning as a Service (MLaaS) (Ribeiro, Grolinger, and Capretz2015). These companies invest resources

to train a model and profit from third-party developers that want to query their MLaaS model to obtain predictions. For these companies, it is important that their ML models are queried as much as possible, and it is understandable that they aim to keep the internal workings concealed. In the healthcare industry, models have been published as well, and it is expected that the extent will increase in the future (Jiang et al.2017, p. 241). For example, computer vision may help future physicians

to make fast and accurate diagnoses, while natural language processing allows care providers to dictate notes.

However, publishing trained models do not come without risks. In the health-care sector, privacy risks have been considered to be among the most critical risks (Annas 2003, p. 1490). Medical ML models are often trained on privacy-sensitive

training data, such as x-ray images or genomic data. The GDPR and the HIPAA are some examples of efforts that have been made to ensure data privacy. Nonetheless, research shows that it is possible to make accurate predictions about the (possibly sensitive) data that a model was trained on, just by querying a black-box ML model (Shokri et al. 2017) (Fredrikson, Jha, and Ristenpart 2015). An attacker can also

try to reverse engineer an original ML model to create a replica. Such attacks are named model extraction attacks; the purpose is to capture the complete workings of the model (Tramèr et al.2016, p. 601). Research has shown that reverse engineering

a black-box model to a white-box model, leads to increased privacy risks because it augments the attack surface of the attacker. This happens because the white-box model can provide the attacker with more insight into the workings of the model. In this paper, the attention was focused on data privacy. However, in addition to privacy risks, risks can also be economical in nature. If one successfully manages to extract the model, then the dependence on the paid ML service is terminated because one possesses a functional equivalent. The attacker can then query this model for free and possibly distribute the model, increasing the economical burden on the original model provider. Furthermore, the attacker can also inspect the inner workings of the extracted model.

As a model extraction attack is essentially reverse engineering a black-box model to a white-box model, it can serve as a proxy for the original model on which the attacker can work stealthily. Therefore, defending against model extraction at-tacks is of key importance to protect data privacy and logically to ensure model privacy. Privacy-Preserving Machine Learning (PPML) is the umbrella term that is used to describe ML methods that prioritise data privacy. Roughly, PPML can be divided into methods that deal with model privacy and ones that deal with data privacy. Data privacy can practically be achieved in three ways; through anonymisa-tion, cryptography and obfuscation. It has been demonstrated that anonymisation cannot always guarantee data privacy (Narayanan and Shmatikov 2006) (Rocher,

(11)

Hendrickx, and De Montjoye 2019). Cryptography, on the other hand, could be

used to protect data for outsiders without access, but it falls short in guaranteeing privacy when the encryption mechanism is broken or when a party with access in-fringes on privacy. In contrast, obfuscation based mechanisms could ensure data privacy, but this comes at a price; one that is paid by the utility of the data. This is known as the privacy-utility trade-off.

Differential privacy (DP) is an obfuscation technique that is capable of defining privacy mathematically. This technique can be used to share general statistics about a population (in a dataset) while the mechanism withholds information about indi-viduals of this population. DP is generally achieved by adding noise to a greater or lesser extent. The privacy-utility trade-off makes no exception for DP; the privacy increases when the noise increases, but as a consequence, the utility of the data gen-erally decreases. Therefore, differential privacy is not fixed, but rather a gradual measure that can be used to quantify the extent to which the privacy of individuals in a population is ensured.

This research is part of a broader study that addresses data privacy in ML models that are utilised in the healthcare sector. ML models and specifically Deep Neu-ral Networks (DNNs) are trained with different levels of DP added to them. The purpose of this paper is to give both theoretical and practical insight into the con-struction of DP models and the consequences for model extraction attacks. This is attempted by simulating and analysing the interaction between a defending vic-tim, who published a DNN, and an attacker who aims to infringe the privacy of a model by attempting to extract this DNN. The broader study is designed to conduct follow-up studies with the extracted models to perform two qualitatively different attacks. These attacks are named, membership inference and model inversion attacks (Shokri et al. 2017) (Fredrikson, Jha, and Ristenpart 2015). The former attack

in-tends to reveal whether or not some data point was included in the training data of a model, while the latter aims to reconstruct original training data. Both pose a direct threat to the data privacy that was included during the training of a model. In the grand scheme of things, it is regarded whether this research, which focuses mainly on model extraction attacks on models trained with DP, offers an advantage to the attacker to conduct model reconstruction and membership inference attacks. This research primarily aims to address two questions: (1) what are the general effects of DP, when added to DNNs to ensure data privacy, on the potentiality of model extraction attacks; (2) can DP ensure data privacy, while still providing data utility, in practical situations that arise when a DNN is trained on high dimensional data?

This paper can roughly be divided into two parts. The first part is dedicated to exploring the effects of DP on DNNs, as well as to investigate what the capabili-ties of an attacker are to steal such a private model. These experiments have been conducted on a general and relatively low dimensional dataset which contains hand-written digits; the MNIST dataset. This decision was made to analyse the universal and fundamental effects that DP has on (DNN) model extraction attacks, without being distracted by superfluous incidents. In this respect, it can be regarded as a good candidate model to answer the first question. On the other hand, it might be insufficient and superficial because it lacks the depth of certain state-of-the-art ML models that are trained on complex data. Consequently, this experiment is unsuited to answer the second question. Therefore, a second part was added to shed a light on the practical side of implementing differential privacy to a realistic and topical issue. Various experiments have been conducted on relatively high dimensional data in the form of chest x-ray images that belong to patients that have been diag-nosed with COVID-19, pneumonia, or neither. This offers an insight into questions such as: how complex is it to add DP to high-dimensional data; and what are the implications of the privacy-utility trade-off?

(12)

In advance of these experiments, efforts have been made to supply the reader with a background in deep learning and differential privacy. Thereupon, it provides an overview and elaborates upon state-of-the-art research in the area of model extrac-tion attacks. Subsequently, the approach and methodology that was used in this research have been described. The decision has been made to provide rich back-ground information so that the context and the scope of the experiments become clear. Finally, the results are discussed and the conclusion is drawn.

(13)

2 related work

2.1 Deep Learning

In this section, a few basic concepts of Deep Learning (DL) are explained. This is by no means an inclusive or all-embracing description of DL as many aspects are omitted. It serves only as a brief introduction to touch upon concepts that are frequently used in this paper, such as Stochastic Gradient Descent (SGD) and Convolutional Neural Networks (CNN).

Conventional ML techniques were limited in their ability to process natural data in raw form. Implementing ML techniques required domain expertise and care-ful engineering to transform the raw data into a suitable internal representation from which an ML model could learn (LeCun, Bengio, and Hinton 2015, p. 436).

Representation learning methods allow machines to "automatically" learn suitable representations from raw data; deep learning is such a method (LeCun, Bengio, and Hinton2015, p. 436). It is named deep learning because multiple layers are used

in the network. Each of these layers transforms the representation of the previous layer (starting with the raw input) into a more abstract and higher-level represen-tation. In this process complex functions can be learned by composition of enough transformations (LeCun, Bengio, and Hinton2015, p. 436).

In machine learning, the task of mapping an input to an output based on example input-output pairs is named supervised learning. An ML model learns this mapping during the training phase as it aims to infer a function that maps the input to the output. Thereafter, if the model learned, and if it performs well, it can map novel inputs to the correct outputs.

To understand this paper completely, it is important to roughly understand how ML models learn from input-output pairs. This can best be understood by using an example. Imagine that a hospital has x-ray images of patients and the doctor determined whether the patients are sick or not. Therefore, the doctor classifies each patient and we speak of different patient classes, corresponding to the different judgements of the doctor. The images can be used as inputs, while the doctor’s expertise provides the corresponding outputs. During training, we feed the ML model the inputs and let it predict output in the form of confidence values. These confidence values are the prediction probabilities for each class that the model pre-dicts. We then compute a loss function that measures the error between the output of the model with the desired output determined by the doctor. The model learns from this loss by adjusting its weights to reduce this error. As there can be hun-dreds of millions of weights in a model, it needs to adjust the weights effectively. Stochastic gradient descent (SGD) approaches this by taking the gradient of the loss function with respect to each weight because gradients indicate the direction of the greatest increase (LeCun, Bengio, and Hinton2015, p. 436). As we take the gradient

of the loss function, it tells us how we should change the weights so that the error increases the most1

. This is the opposite of what we want to achieve, so we adjust the weights in the negative direction to the gradient vector to decrease the error the most, which explains why it is called gradient descent. That explains why it is named gradient descent, per contra, stochastic refers to the strategy that is being adopted to update the gradients. It updates the network weights for each input-output pair, therefore the path that the model follows to descent the gradients can be somewhat capricious. This contrasts with batch gradient descent, which com-putes the gradient over a larger batch of samples, and then updates the network. It must be noted that training occurs over the course of a variable number of ‘epochs’, indicating the number of passes of the entire training dataset to the model. Finally, besides stochastic and batch gradient descent, there are plenty of different methods

1 _{It is important to keep this property of gradients in mind, because in section 2.3.2 on page 23 we use the} gradients to increase the error.

(14)

that can be used to adjust the weights in a network to minimise the loss function. These methods are named optimisers.

The aforementioned layers in a network define the architecture of a network. There are different types of layers and ways to order these layers to create a network. There will not be expounded upon this topic in detail, as it is not the purpose of this paper. The essential information is that layers, in general, transform the representa-tion of the previous layers and that this can be done in many different ways which leads to different results. One of the layers that are extensively used in this research is a convolution layer. A convolution layer is commonly used to analyse visual im-agery and applied in image and video recognition software. Networks that make use of such layers are called Convolutional Neural Networks (CNNs, or ConvNets). Explaining CNNs was considered out of the scope of this research; information on CNNs may be provided by Albawi, Mohammed, and Al-Zawi2017.

2.2 Differential Privacy

There exists a trade-off between the utility of released statistical data and the pri-vacy of the data points that contributed to the statistics (Brickell and Shmatikov

2008). Dwork2006proved that data cannot provide utility, while privacy remains

guaranteed if an attacker has access to auxiliary information. This proof of this impossibility has been demonstrated by the use of an example (Dwork2006, p. 2).

Suppose that one’s height is considered a sensitive piece of information, and reveal-ing the height of an individual is considered a privacy breach. If we assume that there is a database that yields the average heights of women of different nationali-ties and that an attacker has access to the auxiliary information that someone, let’s say Terry Gross, is two inches shorter than the average Lithuanian woman. Then it becomes evident that the attacker learns Terry Gross’s height, while anyone with only the auxiliary information learns relatively little. Therefore, Dwork2006finds

two things: (1) this privacy breach occurs regardless of whether the individual, Terry Gross, has been included in the database, and (2) data cannot provide utility, while privacy remains guaranteed if an attacker has access to auxiliary information. She then proceeds to define differential privacy (DP), which is a different approach to privacy that guarantees that the risk to one’s privacy should not substantially increase as a result of participating in a statistical database (Dwork2006). It enables

the creation of useful statistics based on a set of data points, while the privacy of individual data points is ensured. More explicitly, it ensures that the statistics of the dataset are indistinguishable, to a certain degree, from the statistics that would be derived from a dataset where any one of the data points would be absent. For this applies to any one of the data points, statistics that make use of DP guarantees privacy to all of the individual data points.

-DP guarantees DP to a certain degree that has been expressed in terms of . The degree of DP that implies will become clear in section 2.2.2 on page 16. It has been mathematically defined as following (Dwork2006, p. 9):

Theorem 1. A randomised function K provides -DP if for all datasets D1and D2differing

on at most one element, and S ⊆ Range(K):

Pr[K(D1)∈ S] 6 exp() × Pr[K(D2)∈ S] (1)

If an attacker has access to auxiliary information, then statistics that are derived from data points inherently leak information about the absence or presence of data points, unless the statistics are purely random noise, providing no utility (Lee and Clifton2011, pp. 334–335). Yet with differential privacy, the amount of leakage can

be regulated and defined in terms of ; defines to what degree the information may leak. This degree of allowed privacy leakage that quantifies is called the pri-vacy budget. To regulate the pripri-vacy budget, DP essentially perturbs the statistics with random noise; it is an additive noise mechanism. In PPML, the data is com-posed of the training set and the statistics that are released are equal to the output

(15)

that the ML model returns. Therefore, to achieve -DP, the model output should be perturbed with noise to conceal the absence/presence of the data points on which the statistics were composed.

Just as the meaning of privacy depends on the context, the appropriate -value that ensures a certain degree of privacy is also dependant on the context.

The amount of noise directly relates to the degree of privacy about the pres-ence/absence of participants in the database that you are willing to sacrifice. How-ever, it depends on a few other variables too, which will become evident from the example that follows in this section.

2.2.1 Sensitivity

Generally, sensitivity refers to the impact a change in data points can have on the result of a statistical query. Lundmark and Dahlman2017explains the concept of

sensitivity in-depth by providing an insightful example. The study takes two query functions:

q1(x) = x

q2(x) = x2

They let x be a dataset that can either increment or decrement by one. This is exactly what DP tries to obscure; whether a query output was made on a dataset where one data point has been present or absent. They proceed to provide an equation that calculates the difference between the query output of a dataset that differs one data points, where xAand xBis a subset of database x with one differing

value:

∆q(x_A, xB) =| q(xA) − q(xB)|

Theorem 2. For a given query function q : D →Rd, and let xA, xBbe any data set from

all possible data set of X differing in at most one element, sensitivity is defined as: Sensitivity = max

xA,xB⊆X

kq(xA) − q(xB)k1 (2)

where k · k1is the L1-norm distance between data sets differing at most one element.

The effect that the queries q1 and q2, and the corresponding ∆q, have on the

range [0,6] have been visualised in table 1 (Lundmark and Dahlman2017, p. 4).

Data set xi x0 x1 x2 x3 x4 x5 x6

q1(xi) 0 1 2 3 4 5 6

∆q1(xi, xi+1) 1 1 1 1 1 1 ...

q2(xi) 0 1 4 9 16 25 36

∆q2(xi, xi+1) 1 3 5 7 9 11 ...

Table 1. Effects of queries q1and q2, and the corresponding ∆q, have on 0 6 x 6 6. This table shows that the sensitivity of q1 is constant, while the sensitivity of

q2is unbounded; lim∆q2→∞. The sensitivity of a query can be stated both in the

global and local context. Local sensitivity calculates the sensitivity for a local data set, where the possible changes are bound by the local data set and not the universe of all data sets (Lee and Clifton2011, p. 5). In Table table 1, the local sensitivity of

q₁(x_i)is 1 and the local sensitivity q2(xi)is equal to 11 (in range [0,6]). Calculating

the local sensitivity is [easy]. However, Nissim, Raskhodnikova, and Smith 2007

concluded that it is problematic to use local sensitivity to satisfy -DP because the magnitude of the noise could leak information about the dataset. Global sensitivity, on the other hand, refers to the maximum difference of the query output when one

(16)

change is made to any data set possible. This sensitivity is required to meet -DP requirements (Dwork2006, p. 10). Since global sensitivity is defined based on the

consideration of all possible data sets, it is only dependent on the query and not on the data. This has vast consequences on the utility of certain queries, explains Lee and Clifton2011. A query that returns the sum of data points, would have an

infinite global sensitivity. This problem can be solved by introducing boundaries. The dataset can be modified to limit the ability to store values greater or smaller than a predetermined threshold. That way, sensitivity cannot become infinite (note that the size of the data set is known).

2.2.2 Choosing 

Dwork2006proved that for the given query function f and dataset X, a randomised

mechanism Mf that returns f(X) + Y as an answer, where Y is drawn i.i.d from

Lap(∆f), provides -differential privacy. In addition to the foregoing, let Lap(λ) be the Laplace distribution with density function h(x) = _2λ1 exp(−|X−µ|_λ )where µ is the mean and λ(> 0) is the scale factor.

Lee and Clifton 2011 explored how there may be decided upon a suitable .

Again, an example is able to convey clearly when DP succeeds and when it does not. This example is based on a pre-existent example (Lee and Clifton2011, pp. 329–

338).

Participant Age Image data

John 10 15

Peter 25 80

Kim 30 100

Mary 90 200

Table 2.Example database X.

In the example database of Table table 2, four participants are described together with their age and an image that belongs to them. For clarity and demonstration purposes, the image data is limited; the image consists of one hypothetical greyscale pixel.

Suppose that this database, X, comes from a small health study and it belongs to a hospital. In the database, there is one participant, Mary, who has been diagnosed with a disease. Assume now that the hospital wants to release statistics about healthy patients. Let X0denote the database to be published; X0= X −{Mary}. The hospital has to obey privacy laws and it can not reveal whether an individual has a disease or not. Therefore, it decides to allow the data to be queried only through an -DP mechanism. Let’s suppose that the hospital needs to guarantee that the probability of identifying a participant’s presence or absence in the database is less than 1₃.

In this example, it becomes clear that depending on the query, a different value of is required to satisfy the hospital’s requirement.

One of the statistics that the hospital wants to release is the mean age and image data of healthy participants in the health study (X0). The mean of age is is query function qageand the mean of the image data is qimage, To adhere to the privacy

regulations, the sensitivity of the query functions, ∆f, is computed at first. The global sensitivity of qageon X0 can be computed. As said before, calculating the

global sensitivity requires global knowledge on the domain since every possible attribute value that not only presently exists in X0, but the ones that could have existed in X0, needs to be considered. In this case, there are only four participants in the original database X, so there are four attributes that need to be considered for X0. If it would, however, be the case that this database could grow, then the hospital should work with hypothetical data to guarantee -DP. The problem that the mean of the age and image data could be reasonably solved by assuming that the age has

(17)

an upper bound of 125 and the image data, which is a greyscale pixel, has an upper bound of 255. In this example though, we assume that the dataset X is known to be composed of these four participants. We even assume that the attacker has access to the records in X. With -DP, we have to assume that adversary knows everything about the universe, except which individual is missing in database X’, or better said, who carries the disease.

Therefore, the global sensitivity of qageand qimage on X0can be calculated by

using the following formula, in which Y is any possible data instance of size 3 and tis a tuple of Y (in relational databases t is often called a row):

∆q(X0) =max

Y⊂X|q(Y) − q(Y − t)| where |Y| = 3

In our case, this is equal to: ∆qage(X0) = 10 + 25 + 90 3 − 10 + 25 2 = 145 6

Let’s suppose that the hospital decides to choose = 5. The random noise gener-ated by the Laplace distribution with µ = 0 has a scale factor λ. In this case that would be ∆qage

=

29

6. The query response γ is the result of the original query

perturbed with Laplacian noise by the -DP mechanism K:

γ = Kqage= qage(X 0_{) + Lap(}∆qage ) = 10 + 25 + 30 3 + Lap( 29 5 ) = 65 3 + 5.3333 ≈ 27 This is actually a Laplace distribution shifted so that the mean is on qage(X0)and

the scale factor is ∆qage

e . By making use of the cumulative density function of the

Laplace distribution, we can proof:

Pr[Kqage(X) > 27]

Pr[Kqage(X0) > 27]

6 2

The attacker cannot distinguish whether the response γ came from queries against Xand X0within the factor of 2 and so DP is guaranteed. However, as mentioned before, what this means is entirely dependent on the context. What really is im-portant is, and what underlies the concept of -DP, is the question: to what degree is the privacy of the individuals in the database protected?

To determine membership in X0, the adversary devises a set of tuples hω, α, βi for each possible combination ω of X0. We refer to each possible combination ω in Ψ as a possible world. Note that there are 4 different worlds possible if you can create worlds of size 3 from a dataset with 4 data points2

. The α and β that relate to each world are the prior and posterior belief respectively on X0 = ω. In the example, we assume that the prior belief about each world is uniform; in other words, the attacker has no bias towards any world in particular. This is not always the case. If it would be the case that the disease that Mary suffers from is more likely to occur in elderly people, then the attacker could use this information to his advantage. Under the assumption that the belief is uniform, we get ∀ω ∈ Ψ, α(ω) = _n1.

Theorem 3. Given the query function q and the query response γ = Kq(X0), for each

possible world ω, the posterior belief on ω of the attacker is defined as: B(ω) = P(X0= ω| γ) =P(Kq(ω) = γ) P(γ) = P(Kq(ω) = γ) P ψ∈Ψ P(Kq(ψ) = γ) (3)

where Kq is an -DP mechanism for the query function q. 2 _{The rule} n

n−1 = n can be applied, this yields 4

(18)

The attacker queries the different worlds and computes this posterior belief β(ω) for each world. The best guess of the attacker is to consider the world with the highest posterior belief as the real world. The confidence that the attacker has is then defined by the difference between the posterior belief and the prior belief; conf(ω0) = β(ω0) − α(ω0).

In Table table 3 the posterior belief B(ω) of the attacker is visualised on the query: qage. It is important to understand that this table was constructed on the outputs

of an -DP mechanism, that randomly takes samples from the Laplace distribution. In other words, this is an arbitrary table used for explanation purposes.

Possible world = 0.00005 = 0.01 = 0.1 = 0.5 = 1 = 2 = 5 = 50

{10, 25, 30}* 0.46302 0.36849 0.17848 0.19787 0.18548 0.66463 0.9737 1.0 {10, 25, 90} 0.12817 0.00156 0.40654 0.11842 0.21678 0.01978 0.00467 0.0 {10, 30, 90} 0.36396 0.37161 0.23953 0.31976 0.57925 0.27775 0.0123 0.0 {25, 30, 90} 0.04485 0.25834 0.17544 0.36395 0.01849 0.03784 0.00932 0.0

Table 3. Posterior belief on Kqage with different values for .

Statistically, the real world, w1({10, 25, 30}) should become more plausible as

increases. An of 50 limits the output perturbation to the point where it is very clear that w1 is correct. Therefore, privacy is breached because the attacker figures

that Mary suffers from the disease. The value 5 and 2 also seem to leak the real value. This means that if the attacker made this observation, the hospital would have failed in guaranteeing individual privacy. However, in the cases where is one or lower, the attacker would not be able to breach the privacy. Except for the case where is equal to 5e − 5, this is because the random output perturbation actually made the real world the most plausible, by chance. This is a good thing though because if the -DP mechanism would make sure that this would never occur, the attacker could figure out that w1 never occurred even though it should

happen by chance. This could then breach privacy again.

As mentioned before, the appropriate value of that satisfies a certain degree of privacy is dependent on the context. If the query would be to release the mean of the image data, qimage, then it would be a different scenario. The sensitivity

would then be equal to 305₆ and the posteriors B(ω) would change to for this query too; see Table table 4.

Possible world = 0.00005 = 0.01 = 0.1 = 0.5 = 1 = 2 = 5 = 50

{15, 80, 100}* 0.10922 0.2734 0.23134 0.19973 0.37105 0.22521 0.93727 1.0 {15, 80, 200} 0.38153 0.55227 0.23863 0.45984 0.16813 0.29269 0.0592 0.0 {15, 100, 200} 0.26828 0.06757 0.22249 0.33046 0.23934 0.40988 0.0026 0.0 {80, 100, 200} 0.24098 0.10676 0.30755 0.00998 0.22148 0.07222 0.00093 0.0

Table 4. Posterior belief on Kqimagewith different values for .

The question remains, what is the optimal -value to choose? A low limits the abilities of the attacker to identify an individual, but it also limits the utility of the data. We can calculate the value of that satisfies that the probability of picking the right world based on the posteriors is equal to our wish: 1₃. We can rewrite the posterior belief formula β(ω) to the following:

β(ω) = 1

1 + (n − 1)e−∆v∆q

where ∆v = max16i,j6n

q(wi) − q(wj)

and i 6= j and n = 4, because there are four worlds to be considered (Lee and Clifton2011, p. 334). ∆v can be understood

as the maximum difference that is possible in the query results. In our example with qage, ∆v would be equal to: 145₃ − 65₃ = 80₃. To make every world equally

plausible, B(ω) needs to be equal to n1. The only value of e that satisfies this is

(19)

Therefore, an agreement between privacy and utility is inherently required. In this case, we agree on ρ = 1₃, which is the probability that the attacker determines the right world and finds out who carries the disease (and who doesn’t). We can bound the rewritten formula of posterior belief to this probability ρ and then find the value of that satisfies this. This can be done as following:

1 1 + (n − 1)e−∆v∆q

6 ρ This can be rearranged to:

₆ ∆q ∆vln

(n − 1)ρ 1 − ρ

Solving this for qage and a ρ of 1₃ is then only a matter of filling in the variables.

Remember that ∆qageis145₆ , ∆v is 80₃ and n = 4.

₆ 29 32ln(

3

2)≈ 0.367

This proves that an of 0.367 would strictly guarantee that ρ = 1₃.

For ∆qimage a slightly less noise needs to be added to achieve the same privacy

guarantee. ₆ 61 74ln( 3 2)≈ 0.334 2.2.3 Relaxing -DP

Regardless, it must be noted that if the domain does not include outliers (the at-tribute sensitivity is low), that the upper bound of that satisfies ρ = 1₃ might be excessive. In such a case, a binary search could be utilised to find an appropriate and less strict value for (Lee and Clifton2011, p. 336).

We can also relax -DP in another way. (, δ)-DP relaxes the notion of -DP by allowing the -DP to be broken with a probability that is 6 δ. In this way, we can work with a less strict notion of privacy, but the degree of strictness that we lose is quantified in the parameter δ. We can still achieve pure differential privacy with (, δ)-DP if we set δ = 0 (Steinke and Ullman 2015, p. 1). This notion of DP is

defined by:

Theorem 4. A randomised function K provides (, δ)-DP if for all datasets D1 and D2

differing on at most one element, and S ⊆ Range(K):

Pr[K(D1)∈ S] 6 exp() × Pr[K(D2)∈ S] + δ (4)

The choice of the value δ is arbitrary, but usually, it is set to the inverse size of the dataset X that is being queried; _|X|1 (Dwork, Roth, et al.2014, p. 18).

2.2.4 Applying DP on DNNs

Differential privacy can be applied to ML models too. This allows one to distribute a model while promising that the privacy of the individual data points on which the model has been trained can be guaranteed to a certain degree. However, it generally requires a more sophisticated approach than our previous example, because of the more complex nature of DNNs. The sensitivity of the output of a DNN is also less clear-cut. Furthermore, the dimensionality of the input data with which a DNN can be queried is generally significantly larger than previously described in our example. For example, a 200 by 200 pixel colour image encoded in RGB, which has three channels, is composed from 200 × 200 × 3 = 120000 individual values.

Abadi et al.2016explored deep learning with differential privacy. They describe

(20)

the model parameters. However, they find that this is not feasible because adding overly conservative noise often destroys the utility of the model. Others have sug-gested adding (, δ)-DP noise to the logits of a model (Orekondy, Shadi, and Fritz

2019, p. 2). Although this would ensure that training is not hindered, it would not

provide a viable practical solution because the model can be queried a large number of times and averaging the outputs would break the privacy. By following the law of large numbers, we can show:

Pm→∞ m=1 Kq(x)

m ≈ q(x). Abadi et al. 2016write that

a possible solution is to add DP to the SGD optimiser; this is called DP-SGD. Dur-ing trainDur-ing on a dataset containDur-ing N data points, this can be achieves as follows. At each step of the SGD, the gradient is calculated as usual, then the l2 norm of

this gradient is clipped. Abadi et al.2016found it appropriate to choose the

gra-dient clipping norm by taking the median of the norms of the unclipped gragra-dients throughout the training procedure. After L gradients have been calculated, they compute the average of these gradients and add noise to protect privacy. Finally, to train the model, the weights are updated in the opposite direction of this average noisy gradient, similar to what has been defined in section 2.1 on page 13. By using the privacy amplification theorem, they prove that by doing so, each step guarantees (q, qδ)-DP with respect to the full database, in which q is equal to _NL (Abadi et al.

2016, p. 310). However, they mention that until then the best overall bound could

be computed by making use of the strong composition theorem, but they invented a stronger method called moments accountant. With this, they achieve significant tighter bounds for and δ, which is good because it allows for more utility of the data while guaranteeing the same (, δ)-DP. The moments accountant method of achieving (, δ)-DP in DNNs has been used in this paper.

Nonetheless, there are complexities involved in adding DP to DL models. The cost of privacy is diminished performance (the privacy-utility trade-off) and this is one factor that hinders the deployment of DP. Preliminary research indicates that this cost on the accuracy is not borne equally. Bagdasaryan, Poursaeed, and Shmatikov2019report that the accuracy of underrepresented classes drops

dispro-portionately much. They conducted experiments in which they also used the mo-ments accountant technique to add DP to DL models. They found that gradient clipping, as well as adding noise, takes its toll particularly on the accuracy of un-derrepresented classes. They conclude that if a non-DP model is unfair (in the sense that the accuracy is not equal for all classes), then DP-SGD exacerbates this unfairness.

2.3 Model Extraction Attacks

In this section particular attention is devoted to two papers; Knockoff Nets and ActiveThief respectively. These papers are written by Orekondy, Schiele, and Fritz

2019 and Pal et al. 2020. They provide a practical approach to model extraction

attacks and this was used in this research. Thereupon the expected capabilities of model extraction attacks are reviewed. Finally, a short review of other relevant literature in this field has been provided. This section contains literature that was not used in this paper, either because it was outside of the scope of this research or because of the time frame limits of this research.

Principally, model extraction attacks exploit query access to steal/replicate the functionality of models (Tramèr et al.2016, p. 601). In this regard, a victim who

published a black-box model is considered, as well as an attacker with query access and an intention to steal the victim’s model. Typically, model extraction attacks are carried out on DNNs as following: the attacker acquires a sample set and queries the victim model with these samples, thereupon the victim returns confidence val-ues/labels for these queried samples, and the attacker uses this information to train a functional equivalent model.

(21)

The attacker may possess various degrees of prior knowledge about this victim model, such as knowledge about the architecture of the victim model, or the training data that was used during training. Besides possessing knowledge, the capabilities of the attacker play an important role too. For example, an attacker might know that a model was trained on particular data, such as x-ray images of hands, but the attacker might not be able to acquire comparable data. That is to say that knowledge alone does not necessarily give the attacker an advantage, though it may surely help. Even if the attacker cannot obtain comparable training data, the attacker might still be able to discern a suitable alternative training set; such as x-ray images of feet. A similar scenario has been examined during the experiments that were conducted in section section 4.1.2 on page 34.

2.3.1 Knockoff Nets

Orekondy, Schiele, and Fritz2019proposed an intuitive way of extracting different

DNNs. It essentially considers a black-box victim model that an attacker wants to extract, or better said, turn into a local white-box model. Orekondy, Schiele, and Fritz 2019observed that a black-box victim model essentially acts as an

ora-cle, which describes arbitrary inputs. Any possible input can be queried, and the black-box victim model will describe this input in the sense of output values, be it confidence values or labels. Learning from such a model mimics the task of su-pervised learning, which is, mapping inputs to outputs, based on features-label pairs. It’s actually training a new model to learn under the supervision of the vic-tim model. Figure fig. 1 visualises how Orekondy, Schiele, and Fritz2019creates a

Knockoff Net.

Figure 1. Flowchart describing the process of extracting a black-box model.

On the left side, it is visualised that the victim possesses a victim database, con-taining features and labels. The victim trains the victim model with these features and labels. After the training phase is done, the victim provides the attacker with query access to the black-box victim model (this can occur when the victim pub-licises the model). The attacker, who is visualised on the right side in the image does not possess the victim database. However, he does possess different data to query the victim model. Note that this data does not have to contain labels, because the labels are to be acquired from the oracle of the black-box victim model. Upon querying the victim model, he receives a victim prediction. These feature-prediction pairs can be used to then train the attacker model. The predictions that the model returns can be of different quality. In this study, the output in the form of confi-dence values, as well as labels has been regarded. Orekondy, Schiele, and Fritz

(22)

2019finally add that similarities can be drawn between model extraction attacks

and knowledge distillation (KD) (Hinton, Vinyals, and Dean2015). In both cases,

the objective is to transfer knowledge from a larger teacher network to a compact student network.

sampling strategy The way in which the attacker queries the victim model is of importance too. The attacker can query the victim model with random features from the attacker database, this is called a random sampling method, or an adap-tive sampling method could be applied. Orekondy, Schiele, and Fritz2019take a

reinforcement learning approach to adaptive sampling. They evaluate the quality of sampled images based on three rewards. The certainty of the victim model, di-versity of the samples, and the difference between the victim model’s predictions and the attacker model’s predictions. Based on that they pick new samples from the attacker database, in the hope that the attacker can learn more rapidly from the victim than when a random approach is adopted.

The number of queries that the attacker is allowed to make to the victim model may be limited. This is called a query budget. Such a budget can be necessary for a variety of reasons. There are cases in which the victim model does not allow more than n queries, or because more than n queries become suspicious (and the attacker wants to stay stealth), or because the costs of querying a victim model are high. The latter reason could be relevant if the goal is to steal a paid MLaaS model for financial reasons.

contributions They make three main observations: (1) querying random im-ages from a different distribution than that of the black-box training data can still result in a well-performing Knockoff Net; (2) this is possible even when the Knock-off Net is represented using a different architecture; (3) an adaptive sampling strat-egy improves query sample efficiency and provides performance gains.

reconstructing results Along with their paper, Orekondy, Schiele, and Fritz

2019did provide (overall) operating code on GitHub for the majority of their

exper-iments. This helped in reconstructing most of their results, though their adaptive learning strategy had not been included in their code. Implementing the adaptive learning strategy is also not straight forward because the code purposefully splits the collection of the victim input-output pairs and the training of the attacker model. Therefore, the structure of the code needs to be changed to devise an adaptive sam-pling strategy for the attacker to pick samples from the attacker database, based on the output of the victim model.

2.3.2 ActiveThief

Pal et al. 2020 approach model extraction attacks in the same way as described

above; Figure fig. 1 on the preceding page is applicable again. Therefore, rather than repeating, this section is dedicated to describe the differences in approach and findings.

query set types They describe that the attacker can have multiple levels of knowledge and access to the victim database. The attacker is expected to be most successful if the attacker has access to the victim problem domain; this is the type of this data belongs to the original problem domain. Usually, the attacker does not possess this data. At least, not if the victim succeeds in protecting the privacy of individuals in the dataset. In the case that the victim created a good-performing model based on publicly accessible data, then this scenario could arise if the attacker aims to extract the model to benefit from the performance purposes. If this type of data is inaccessible, the attacker could still have access to the limited problem domain, which is data that closely resembles the victim database. The attacker

(23)

can also have access to natural non-problem domain data, which is sampled from publicly available. Otherwise, the attacker may choose to sample from the synthetic non-problem domain, which is sampled from standard probability distributions that do not necessarily model the problem domain distributions. Pal et al.2020note

that data coming from the natural non-problem domain has shown more effective than data coming from the synthetic non-problem domain.

agreement They introduce the evaluation method agreement, which measures the closeness between the predictions coming from the victim and attacker models, which is evaluated on the test set.

sampling strategy Furthermore, they experimented with four different adap-tive learning sampling strategies. The first one was based on the uncertainty of the attacker model. The second one maximises sample diversity by adopting the greedy K-Center algorithm of Sener and Savarese2017. The third one uses Deep

Fool Adversarial Learning (DFAL) from Ducoffe and Precioso2018to select highly

informative samples. The final strategy is a combination of the last two and it’s called the DFAL K-Center approach.

They found that the combined strategy, DFAL K-Center, yields the best overall performance on image classifiers (Pal et al.2020, p. 5). Therefore the decision was

made to implement this sampling strategy in this research.

DFAL is an active learning strategy that has been developed to minimise the number of data annotations queried from an oracle during training (Ducoffe and Precioso2018, p. 1). It focuses on finding samples that lay close to the decision

boundary of the model. The margin theory for active learning states that samples close to the decision boundary can considerably decrease the number of annota-tions required, yet it is intractable to measure the exact distance from samples to the decision boundaries (Ducoffe and Precioso2018, p. 1). The information that

adversarial examples provide about the decision boundaries of a model can be ex-ploited to our advantage. Before proceeding, it is necessary to understand what adversarial examples are.

Virtually, adversarial examples can be thought of as optical illusions for ML sys-tems. These samples trick an ML model into classifying the sample as one thing, while it is, in fact, another thing. This perturbation can be achieved by adding noise. Figure fig. 2 shows how an image of the handwritten one can be perturbed by adding noise. This noise can cause the ML model to classify the digit as a seven, instead of a one.

Figure 2. An image of the handwritten digit one can be perturbed by adding noise to it so that a model classifies it as a seven instead.

Two common strategies to create adversarial examples are Fast Gradient Sign Method (FGSM) and DFAL. In essence, they are quite similar because they both have the same purpose (creating adversarial examples). FGSM makes a model pre-dict on a given input (i.e. the left image in Figure fig. 2 of a one), after which it calculates the loss of these predictions of the model. It then uses this loss to cal-culate the gradients of the model, but instead of updating the network to be more accurate, it updates the input image itself. It calculates the gradients with respect to the pixel values of the input image and then takes the sign of these gradients.

(24)

This means that all negative values become −1 and all positive values become 1. It then multiplies these signs by a scalar, , so that the matrix contains positive and negative values3

. This is noisy data, but it is not random noise (see the image in the middle of Figure fig. 2 on the previous page. It adds this noise to the image data (this produces the image on the right in Figure fig. 2 on the preceding page. This has the result that the model is now more likely to classify this image as a seven. This happens because instead of changing the images in the negative di-rection of the gradient, which would make the model classifier more confident4

, it now changed into the positive direction of the gradient, which makes the classifier less confident, maybe even to the point that it gets classified as another class. As mentioned before, DFAL has the same goal, but it does this more efficiently and ef-fectively. This strategy chooses to query the victim model with the original samples that are most close to the estimated attacker decision boundaries that have been identified by crafting adversarial samples. They choose to query with the original samples, rather than the perturbed ones, in order to stay stealth as an attacker. That is because perturbed adversarial samples can be detected by methods that are at-tentive to the distribution that the queried samples follow. For instance, Juuti et al.

2019found that the distance between natural samples should fit a (close to) normal

(Gaussian) distribution. However, the difference between adversarial samples de-viates strongly from the normal distribution and therefore this can be detected in practice (Juuti et al.2019, pp. 521–522). Thus by using original and natural samples,

ActiveThief assures that this attack method evades detection by model the method proposed by Juuti et al. 2019 (Pal et al. 2020, p. 6). The K-Centre algorithm is

used to select diverse samples. It works by querying the own model with samples from the query set to obtain output values; Mattacker(X). These output values

Mattacker(X)are then used as initial cluster centres. Pal et al.2020 describe that

in each subsequent iteration, the strategy selects k samples Xn which correspond

to the outputs Mattacker(Xn)that are most distant from all existing centres. This

outputs Mattacker(Xn)are subsequently added to the already existing centres so

that the next set of samples is the most distant from all samples that were queried before.

Pal et al. 2020 found that the K-Centre strategy maximises diversity, but does

not ensure that each sample is informative. On the other hand, the DFAL strategy ensures that each sample is informative because it lays close to the decision bound-aries, but it does not eliminate redundancy. Therefore, the DFAL strategy is first used to pick an initial subset of m informative samples, and from these m samples, the K-Centre strategy picks n diverse samples.

2.3.3 Capabilities

Early work by Tramèr et al.2016shows how models including logistic regression,

neural networks and decision trees can be extracted. They found that their attacks could extract the exact parameters of certain linear classifiers and decision trees, belonging to the victim. It would be less likely that the exact weights of neural net-works could be extracted in this manner. The reason is that neural netnet-works typi-cally have a lot of weights, as described in section 2.1 on page 13, and these weights are often initialised stochastically by using, for example, the "Xavier" normalised initialisation method as described by Glorot and Bengio2010. Furthermore, during

training, the weights can be updated stochastically as well by SGD for example. This has been described in section 2.1 on page 13 and the result is that functionally equivalent neural networks can adopt completely different weights. Tramèr et al.

2016did note that their attacks yield close functionally equivalents of victim

mod-els. They experimented with limiting the victim model to output only predicted

3 Please mind that this has nothing to do with the of DP.

4 _{Modifying the input image in this manner to make a classifier more confident is essentially what the} model inversion attack, as described by Fredrikson, Jha, and Ristenpart2015_{, attempts.}

(25)

classes instead of the confidence values, and they found that this reduces the capa-bilities of the attacker, though the model extraction attack may still be harmful to the victim (Tramèr et al.2016, p. 615). Furthermore, they also mention that model

extraction attacks may facilitate data privacy-abusing attacks, such as model inver-sion (Tramèr et al.2016, pp. 607–608). Additionally, Tramèr et al. 2016discuss DP

as well. They describe how DP is often used to protect individual training data, if this is accomplished, it should decrease the attacker’s capabilities to learn informa-tion about training set elements. However, they note that this usage of DP is not defined to prevent model extraction attacks. They explain that model extraction attacks could theoretically be hindered if DP is applied directly to the model pa-rameters. In other words, this would cause the attacker to not distinguish between neighbouring model parameters.

2.3.4 Other

Oh, Schiele, and Fritz2019present a method to infer attributes of black-box models.

These attributes that can be inferred are the architecture of the model, the training hyperparameters and the data that the model was trained on. It is expected that the attacker can take advantage of this information to perform model extraction attacks.

Shi, Sagduyu, and Grushin2017showcase model extraction attacks on black-box

machine learning models. These models include artificial neural networks, Naive Bayes and Support Vector Machines (SVM). They have researched model extraction attacks by training attacker models that use different architectures than the victim model. In general, they conclude that deep learning architectures are capable of inferring parameters of Naive Bayes and SVM classifiers, while the opposite does not necessarily hold.

Others have done similar experiments with model extraction attacks on SVMs and extended their work to Support Vector Regression (SVR) (Reith, Schneider, and Tkachenko2019). Reith, Schneider, and Tkachenko2019 shows how their attacks

can be executed in practice too by performing tests on different models.

Finally, pioneering research has been done on protecting against model extrac-tion attacks by Juuti et al.2019. They are able to detect abnormal behaviour that

typically occurs when model extraction attacks are performed. This is the first step to protecting against model extraction. They reported that PRADA is capable of de-tecting all prior model extraction attacks, though section 2.3.2 on page 23 describes how their detection has been bypassed. This highlights the current lack of PRADAs detection software.

(26)

3 approach and methodology

3.1 Approach

As has been mentioned before, the experiments in this research have been divided into two parts. The first part starts in section 4.1 on page 32 and it explores how DP can be added to DNNs that are trained on the relatively low-dimensional MNIST dataset. The MNIST dataset contains handwritten images of digits and it is com-monly used to train a digit classifier on. The setting is considered in which a victim trains multiple DNNs that guarantee different levels of DP on the MNIST dataset. Whereupon the attacker performs model extraction attacks on these DP DNNs. This interaction can potentially provide an insight into training a DP DNN, as well as the capabilities that an attacker possesses to perform model extraction.

Part two, which has been described in section 4.2 on page 47, focuses on the feasibility and practical side of adding DP to relatively large DNNs, named Residual Neural Networks, that are trained on complex and high-dimensional data. The complex and high-dimensional that was chosen in this research are medical chest x-ray images that are linked to different conditions. These images originate from the COVIDx dataset and the three different conditions or labels are normal, pneumonia and COVID-19. Essentially, the purpose is to train a classifier that can distinguish between the medical conditions based on image data. It is reasonable to assume that there is an incentive to publish such models that are trained on privacy-sensitive data (Rajpurkar et al. 2017, p. 2). Yet because these models are prone to attacks

that infringe the privacy of individuals that are included in the training set, this provides a motive to consider mechanisms that ensure individual privacy, such as DP (Fredrikson, Jha, and Ristenpart2015) (Shokri et al.2017). The details of the

COVIDx dataset have been described extensively in section section 3.2.3 on page 30. In section 3.2 the general methods that have been used have been defined; such as the tools that were used to train the various DNNs and the extension that was used to add DP to the DNNs. Thereafter, one section is dedicated to the MNIST database and architecture that was used, while another section is dedicated to the dataset of chest x-ray images.

3.2 Methodology

general All experiments were conducted by making use of Python version 3.6.9. The tests have been conducted on Google Colab PRO as it provides access to the NVIDIA Tesla P100 GPU together with Cuda compilation tools V10.1.243. The ad-vantage of this is that the code can be run on massively parallel SIMD architectures. Massively parallel hardware can run a larger number of operations per second than the CPU. Generally, it speeds up the performance significantly (NVIDIA2018, p. 1).

PyTorch and TensorFlow are two open-source libraries that can be used to train machine learning models. The advantage of choosing for these libraries is that both are compatible with Cuda. In this research, the decision was made to use PyTorch. applying dp to dnns Models were trained with PyTorch from PyTorch 2020

and subsequently DP was added during training with PyTorch-DP from Facebook

2020. To put it briefly, PyTorch-DP makes use of DP-SGD. This means that noise is

added to the optimiser itself as described in section 2.2.4 on page 19. PyTorch-DP achieves this by adding a Privacy Engine to the optimiser, which is in charge of per-turbing the computed gradients at each step with noise. ‘ PyTorch-DP implements DP based on the Rényi divergence, which is called Rényi DP (RDP) (Facebook2020).

As stated before, the focus of this paper lies on (, δ)-DP, therefore RDP will not be discussed in detail; for additional information, please refer to Mironov2017. RDP

(27)

is described in Mironov 2017. This conversion has been applied to RDP so that

everything could be expressed in terms of (, δ)-DP.

Essentially, there are 3 modifiable parameters in the PyTorch-DP framework that are most relevant to understand in the scope of this research. These parameters ultimately define the and δ in (, δ)-DP. The first parameter is δ, which can be set manually and it directly defines the δ in (, δ)-DP. This parameter δ also has the capability of influencing the value of (Mironov 2017, p. 5). Then there are two

different parameters, σ and c that affect the DP mechanism. In practice, the value of is most determined by σ, which is the noise multiplier. The noise multiplier defines how much noise is added by DP-SGD, as explained in section 2.2.4 on page 19. The other parameter, c, is the gradient clipping norm, which has been described in section 2.2.4 on page 19 as well. Throughout the experiments in the next section, it can be noticed that δ is kept constant on the advised value (which is, the inverse size of the dataset, as explained in section 2.2 on page 14 (Dwork, Roth, et al.2014, p. 18)). Yet there has been experimented with different values of σ and c

. It was empirically found that σ was the most influential on the privacy and utility of the models that were studied.

knockoff nets The terminology used in this section has been described in sec-tion 2.3.1 on page 21.

The code from Orekondy2020was used as a basis to work upon. It provides a

foundation to perform model extraction as described in section 2.3.1 on page 21. It had to be modified to meet the requirements of this research. Six primary compli-cations have been identified:

1. It does not implement differential privacy.

This was countered by training the victim models with DP separately. Then these DP models were exported and imported in Knockoff Nets to train the attacker models.

2. Not all query sets that were used in this research were supported.

The code had to be edited to support the query sets that were used. For example, the FakeData query set, which will be used in section section 4.1.2 on page 34, is not supported by Knockoff Nets. Therefore, the code needs modifications to enable the use of this query set.

3. Not all architectures used in this research were supported.

The code had to be modified to be compatible with the architectures that were used in this research.

4. The adaptive learning strategy has not been implemented in the released code. Implementing this is not straight forward within the current implementation of Knockoff Nets. This is because the query and training procedures are de-tached from each other, though adaptive learning relies on the connection between the two. Therefore, the code has been extended so that it enabled the use of the adaptive learning approach from ActiveThief. The exact implemen-tational details are explained in section 3.2 on the next page.

5. Testing the agreement between Knockoff models and original models is not supported.

This could be realised by letting the victim and attacker model predict the outputs on the test set. Thereafter these outputs could be compared by using the agreement metric as defined in eq. (7) on page 30.

6. At last, it contains minimal errors in the code.

For instance, batch training is flawed and this caused initial erroneous re-sults. This has been reported, though the modifications have not been made by Orekondy2020yet.