A Practical Approach to Differential Private Learning

(1)

MSc Artificial Intelligence

Master Thesis

A Practical Approach to

Differential Private Learning

by

K. L. van der Veen

10695079

July 6, 2018

36 ECTS January-June 2018

Supervisor:

Dr P Bloem

Assessor:

Dr G Patrini

(2)

Abstract

Applying differential private learning to real-world data is currently unpractical. Differential privacy (DP) introduces extra hyper-parameters for which no thorough good practices exist, while manually tuning these hyper-parameters on private data results in low privacy guarantees. Furthermore, the exact guarantees provided by differential privacy for machine learning models are not well understood. Current approaches use undesirable post-hoc privacy attacks on models to assess privacy guarantees. To improve this situation, we introduce three tools to make DP ma-chine learning more practical. First, two sanity checks for differential private learn-ing are proposed. These sanity checks can be carried out in a centralized manner before training, do not involve training on the actual data and are easy to im-plement. Additionally, methods are proposed to reduce the effective number of tuneable privacy parameters by making use of an adaptive clipping bound. Lastly, existing methods regarding large batch training and differential private learning are combined. It is demonstrated that this combination improves model performance within a constant privacy budget.

(3)

Introduction

Training machine learning models on user data without violating privacy is a challenging problem. Besides learning patterns in the data, models can also learn private information about individuals. To prevent this, differential private machine learning was introduced [24]. We identify three problems with current approaches to differential private machine learning:

1. Privacy guarantees are difficult to interpret. A model trained with differential pri-vacy produces statistical pripri-vacy values which are difficult to translate into practical guarantees. Therefore, the privacy guarantees of models are tested after training, which is undesirable, as privacy may already have been violated.

2. In non-private training, multiple models are often trained to find the optimal hy-perparameters. This cannot be directly translated to private training, as privacy spendings accumulate for every trained model. Currently, best practices for choosing the privacy parameters without training multiple models do not exist.

3. Earlier research treats hyper parameters and privacy parameters as being indepen-dent [1, 18]. However, this is not always the case. For instance, the batch size is a parameter that largely influences privacy guarantees, but to preserve model perfor-mance, the learning rate must be changed accordingly.

Many earlier works either focus on preventing private information extraction while reach-ing acceptable performance, or optimizreach-ing privacy guarantees while reachreach-ing non-private performance. Both approaches are different from what is desirable in practice: First de-termine how much privacy is needed for a given task and model. Then, within this privacy “budget”, optimize the performance of the model.

With the rise of large data-centered businesses, data privacy is more relevant than ever. Data is owned by companies providing a free service, and in return, user data is stored and used by those companies. In this ecosystem, users lose control over their data. For instance, data may be shared between companies, used to train machine learning models or used for profiling. Additionally, the number of data leaks is numerous and it is often hard for users to request or delete their own data. In Europe, the General Data Protection Regulation [11] (GDPR), enforcable from 25 May 2018, addresses many of these issues. With the GDPR, data gathering and storage becomes more involved and restricted for companies. For instance, data can only be collected for specified, explicit and legitimate purposes. Furthermore, data should be accessible and removable at any time. The GDPR is one of the first large-scale legal initiatives to protect the data-privacy of individuals. However, the GDPR leaves import topics untouched: user data may be used to compute

(5)

statistics or create models, which may contain sensitive information and will not be re-moved when users remove their data

When data is used to train machine learning models, these kind of additional privacy-related problems arise. While methods to make the training process itself private exist, on which will be elaborated further in next chapters, the resulting model may have extracted sensitive information during training. Differential privacy (DP) is a technique designed to provide information theoretical privacy guarantees for the trained model. In essence, differential privacy computes approximate aggregate statistics on data while bounding the contribution of single data-records. Several researchers successfully applied these ideas to machine learning [1, 18]. However, while information leakage may be bounded, dif-ferential privacy always leaks some information about individual examples. Therefore, a trade-off between model utility and privacy arises. Many works explored the boundaries of this trade-off by fine-tuning the privacy parameters to obtain an optimal trade-off [1, 18]. Others tried to extract sensitive information from trained models, resulting in minimal dif-ferential privacy levels to prevent this [6]. However, such privacy attacks are task specific and result in optimal privacy parameters for a particular attack and model. Additionally, for finding sufficient privacy parameters, the attacks need to experiment with the actual data, which introduces a data leak on itself. Lastly, the relation between model perfor-mance and privacy is not injective: many models with different perforperfor-mances can have the same DP-guarantee.

In this work, we propose more general privacy sanity checks, applicable to a large range of classification problems and models, easy to implement and relatively fast to test. More important, the sanity checks do not involve any fine-tuning on the actual data. We do not claim that these sanity checks provide sufficient privacy guarantees for any specific attack. However, if the sanity checks fail with differential privacy, the privacy parameters should concern users. In addition, within the privacy guarantees needed to pass the sanity checks, best practices for choosing privacy parameters to optimize for model performance are proposed.

(6)

Private learning

2.1 Privacy and machine learning

When sensitive data is used to train machine learning model, we often want to keep two things private: the data and the model. Machine learning models can be trained in a cen-tralized or a distributed manner. When trained cencen-tralized, all data is stored on a single device with the model trained at the same location. In distributed machine learning, data is partitioned between different users that store data on their own device. Model updates are computed locally and combined to train the model. In centralized machine learning, the party training the model has access to the data, which might be undesirable from a privacy perspective. Whereas in decentralized machine learning, users have access to the model, which is also undesirable as models may be a very valuable asset for companies. Moreover, when locally training a model, the updates applied to the model may reveal information about users.

There exist techniques, such as Homomorphic Encryption (HE), Secure Multi-party Com-putation (MPC) and Secure Aggregation [5], that either partly or fully cover these privacy issues. Homomorphic Encryption allows for encrypted computation on an encrypted model or encrypted data, making their true content invisible to the party doing the computations while enabling correct computations. Alternatively, in MPC, multiple non-colluding parties can privately do a computation without seeing the individual components. For machine learning, this translates to computing model updates without having access to both the model and the data. Lastly, with secure aggregation, users can combine model updates such that the party training the model has access to the merged updates, but not on the individual updates. Where both HE and MPC cover the privacy concerns of data/model access quite thoroughly, secure aggregation provides a weaker notion of privacy concerning the model updates. Aggregated gradients reveal less information about user data than individual updates, but there is no quantifiable guarantee on how much information can be leaked. Therefore, the resulting models may contain sensitive information.

In order to prevent parties training a model from having direct access to data, and users having direct access to intermediate models, HE or MPC techniques can be utilized. How-ever, these privacy measures come with a cost. Roughly, MPC techniques introduce com-munication overhead between computations and HE techniques induce computation over-head. SecureML [19], an example of MPC uses two-party computations to privately train logistic regression models and neural networks. An example of HE is CryptoNets [7], in which users homomorphically encrypt their data and send it to a cloud service which trains a model by performing computations on the encrypted data.

(7)

When doing distributed machine learning naively, each user sends their model update to a centralized service, which merges the updates into a single update for a batch of users. These updates contain information about users, and should therefore not be communicated in the clear. Google’s Federated Learning algorithm [15] introduces techniques for fast and secure aggregation of gradients. The focus of their work is mostly on optimizing the com-munication efficiency of the aggregation process and making the protocol robust against adversaries and disconnecting users. While this work is a very relevant contribution to private learning, it lacks any guarantees about the amount of user information leaked by the aggregated gradients during training.

2.2 Learned information in models

2.2.1 Privacy attacks

The intuition that private information is leaked into models during training becomes evi-dent when attacks on models are considered. These privacy attacks aim to extract infor-mation from trained models. Fredrikson et al. [12] showed that recognizable images of peo-ples faces can be extracted given only their name and access to an MLP face-classification model. Moreover, in work by Shokri et al. [25] it was demonstrated that with only black-box access to models, i.e. only having access to inputs and outputs of the model, it was possible to determine whether a record was part of the training set. In work by Carlini et al. [6], inserted credit-card and social security numbers were extracted with again merely black-box access to a next-token-prediction model. Lastly, Zhang et al. [28] exhibited that large neural networks easily fit a random labeling of the data. This result shows that many neural networks have the capacity to memorize entire training sets. Although this is not a privacy attack on itself, it does raise privacy concerns about what is memorized by a model in real-world tasks.

The previously discussed privacy attacks are based on extracting information from a model after training. Contrarily, Hitaj et al. [14] introduced an attack in which a Generative Ad-versarial Network (GAN) was trained to produce images of participants while actively interfering in the training process of an image classification model on faces. In their paper, an adversary uses intermediate models during training as a discriminator for a GAN to train a generator of realistic images of participants. Additionally, the adversary influences the training process by training the intermediate model on the generated images coupled with fake labels, forcing the model to reveal more information when trained on the other participants.

2.2.2 Memorization

In a popular recent paper from Zhang et al. [28] it was shown that large deep learning models can achieve 100% training accuracy when training on random labelings of the data or training on random noise. From a machine learning perspective this is an interesting result, as it shows that models that generalize well on real data have capabilities to mem-orize the full training set, challenging conventional ideas regarding generalization. From a privacy perspective this is also an interesting result. In particular, this shows that large models have large memorization capabilities and for some tasks tend to memorize during training.

(8)

Memorization can be considered as an extreme form of overfitting. In the machine learn-ing community, regularization techniques are widely adopted to prevent overfittlearn-ing on the training data. Surprisingly, Zhang et al.’s [28] work demonstrated that popular regulariza-tion techniques do not protect against memorizaregulariza-tion. Arpit et al.[4] built on this work and examined the role of memorization in learning more extensively. They found that real data contains ”easy examples”, which have a higher chance of being classified correctly and can be learned by some simple patterns. For random noise data there are no easy examples and to learn anything, models are forced to memorize. Aside from being easier to learn, it was shown that simple patterns are learned before memorization occurs. From a privacy perspective, this is an encouraging fact, as private machine learning models should learn patterns without memorization. Additionally, the memorization of individual examples was examined by measuring the loss sensitivity of each example. For real data, the loss sensitivity is only high for a small portion of the full training set, whereas for learning noise, the loss sensitivity is high for all examples. From these experiments it was concluded that on real data, networks do not memorize to the same extend as on random noise. Lastly, the role of regularization was re-examined. It was concluded that even though regularization is often ineffective against memorization, it can limit the speed of memorization.

2.3 Differential Privacy

Differential privacy [8] is an important mathematical framework which bounds the con-tribution of individual entries to some statistic over a database. Intuitively, differential privacy can be imagined as two worlds, both with the same dataset where in one world one example is missing. If in both worlds some statistic is measured, the difference in outcomes should be almost indistinguishable.

Mathematically, differential privacy is defined [9] over all possible transcripts t and databases Dn_{. We can think of a transcript as corresponding to a single query function and response.}

Databases are modeled as a vector of n entries from some domain D. Given an adversary A and a particular database x, transcripts are denoted by the random variable TA(x).

Differ-ential privacy considers all pairs of adjacent datasets, those with just one entry in difference:

Theorem 1. A mechanism is -indistinguishable if for all pairs x, x0 ∈ Dn _{which differ in}

only one entry, for all adversaries A, and for all transcripts t: P r[TA(x) = t]

P r[TA(x0) = t]

≤ e

Here, is called the leakage, the acceptable amount of information extracted from a single record. Differential privacy achieves this guarantee by bounding the sensitivity of the learned statistic and adding noise scaled by this sensitivity [9].

Theorem 2. The L1 sensitivity of a function f : Dn _{→ R}d _{is the smallest number S(f )}

such that for all x, x0 ∈ Dn_{, which differ in a single entry,}

(9)

This form of differential privacy is called -differential privacy. When differential privacy is applied to machine learning this may translate into bounding the sensitivity of the average gradient updates over a batch of users and adding noise accordingly. However, in machine learning, a more popular choice is (, δ)-differential privacy, which is defined [18] as:

Theorem 3. A randomized mechanism M : D → R with a domain D (e.g., possible training datasets) and range R (e.g., all possible trained models) satisfies (, δ)-differential privacy if for any two adjacent datasets d, d0 ∈ D and for any subset of outputs S ⊆ R it holds that P r[M (d) ∈ S] ≤ e_{P r[M (d}0_{) ∈ S] + δ}

Here, a new parameter δ is introduced: the absolute value of the privacy loss will be bounded by with probability at least 1 − δ. Therefore, (, 0) differential privacy corre-sponds to -differential privacy[10]. In practice, the chosen δ value should be at least ≤ _N1, where N is the number of records in the dataset. For instance, McMahan et al. [18] set δ = _N11.1.

A useful feature of differential privacy is that it provides methods to compute the pri-vacy guarantees induced by repeated use of the same dataset. The basic composition theorem [10] states that a k-fold composition of (, δ) DP-algorithm provides (k, kδ)-DP. This means, that the joint output of k calls to a (, δ) DP-algorithm is (k, kδ)-DP. A more complex composition of DP for repeated use of the same dataset is the Advanced compo-sition theorem [10]. This theorem computes a different compocompo-sition on multiple (, δ)-DP calls with a slightly larger δtotal but smaller total than just summing all individual ’s.

Theorem 4. (Advanced composition). For all , δ, δ0 ≥ 0, the class of (, δ)-differentially private mechanisms satisfies (0, kδ + δ0) -differential privacy under k-fold adaptive compo-sition for:

0 =p2k ln(1/δ0_{) + k(e}_{− 1)}

Additionally, when a subset of data is randomly sampled from the dataset with a uniform sampling probability larger than δ to compute a differential private statistic, the Privacy amplification theorem [16] can be exploited. This theorem states the following:

Theorem 5. (Privacy amplification) An (, δ)-DP algorithm applied to a randomly sam-pled subset of the data with uniform sampling probability q > δ is (min(, log(1 + q(e ₋

(10)

2.4 Differential private learning

Differential privacy was originally designed as a mechanism for privacy preserving statistical queries. This section describes how differential privacy can be successfully applied to machine learning and the tools that are used for this in practice.

2.4.1 Differential private SGD and the moments accountant

One of the early well known approaches to achieve differential private machine learning is from Shokri et al. [24]. In their work, MLP’s are trained in a distributed manner by updating a selection of the local gradients and adding noise to them within a privacy budget per parameter. Building on this work, Abadi et al. [1] introduced a simpler dif-ferential private SGD (DPSGD) algorithm (1) that ensures difdif-ferential privacy by clipping gradients to a maximum `2 norm for each layer and subsequently adding noise scaled by

the `2 norm-clipping-bound. It was demonstrated that with the DPSGD algorithm, high

quality models can be trained with privacy under a modest privacy budget. Additionally, the moments accountant was introduced, combining the ideas of the Privacy amplification theorem (5) and the Advanced composition theorem (4). The moments accountant enables automated analysis of the privacy loss while providing much tighter bounds than both indi-vidual methods. Combined, the DPSGD algorithm and the moments accountant introduce a few new parameters, which are used to compute the noised gradients and subsequently accumulate the privacy loss. The σ parameter corresponds to the noise scale applied to the gradient updates. Secondly, a clipping bound C is introduced, corresponding to the bound on the `2-norm of the gradients. The last parameter is the lot size L, determining

the size of the group of gradients for which noise is added. The original DPSGD algorithm [1] is described in Algorithm 1.

From a high level, the moments accountant has two functions: accumulating the privacy spending for each released vector of gradients and computing the intermediate privacy spending. To accumulate the privacy spending the moments accountant uses the noise scale σ and assumes that the DPSGD algorithm correctly noises and clips the gradients. As the DPSGD algorithm clips and noises each layer separately, multiple calls to the mo-ments accountant are made during one iteration. To compute the intermediate privacy spending during training, the moments accountant requires an or δ value and computes its counterpart. The most common approach is to fix δ and record over the course of training.

(11)

Algorithm 1 Differential private SGD (Outline)

Input: Examples {x1, ..., xn}, loss function L(θ) = _N1 P_iL(θ, xi). Parameters: learning

rate ηt, noise scale σ, group size L, gradient norm bound C.

Initialize θ0 randomly

for t ∈ [T ] do

Take a random sample Lt, sampling each example with probability L/N

Compute Gradient

For each i ∈ Lt, compute gt(xi) ← ∆θtL(θt, xi)

Clip gradient ¯ gt(xi) ← gt(xi) / max (1, ||gt(xi)||2 C ) Add noise e gt(xi) ← _L1(P_ig¯t(xi) + N (0, σ2C2I)) Descent θt+1 ← θt− ηegt

Output θt T and compute the overall privacy cost (, δ) using a privacy accounting

method

The by the authors accompanied Tensorflow implementation of the DPSGD algorithm [3] has a few differences compared with the algorithm 1. In non-private SGD, the mean crossentropy loss is optimized with respect to the whole batch. However, the DPSGD algorithm clips individual gradients to bound the sensitivity of the averaged gradient. Therefore, it requires per-example gradient computation. Consequently, as we still compute gradients with respect to the mean loss, the individual gradients are scaled down by a factor batch size. To compensate for this, the clipping bound C is also scaled down by a factor batch size and there is no division by L when noise is added. Note that this still correctly computes the privacy loss, as long gradients are clipped and noise is scaled with the same value.

2.4.2 User-level privacy

Until this point, all of the considered approaches used record-level differential privacy as a framework to protect private information. For instance, the DPSGD algorithm bounds the contribution of individual records to some statistic. In many real work settings, users have multiple sources of data, which may be correlated and should therefore be protected as a whole. Using the moments accountant, McMahan et al. [18] introduced a user-level differential private algorithm called the DP-FedAvg algorithm (2), protecting all the data of one user. Instead of bounding the contribution of a single record, the DP-FedAvg algorithm bounds the contribution of a users dataset to a learned model. In this algorithm, the DPSGD algorithm from Abadi et al.[1] was combined with the FederatedAveraging algorithm from McMahan et al. [17] which couples local stochastic gradient descent on each client with a server that performs model averaging. To ensure user-level differential privacy, the adjacent datasets from Theorem 1 can be defined as user-adjacent datasets [18]:

Theorem 6. User-adjacent datasets: Let d and d0 be two datasets of training examples, where each example is associated with a user. Then, d and d0 are adjacent if d0 can be formed by adding or removing all of the examples associated with a single user from d.

(12)

Compared to the DPSGD algorithm, the DP-FedAvg algorithm introduces a few differ-ences, mainly due to the user setting. Firstly, the clipping bound C from DPSGD is replaced by S in the DP-FedAvg algorithm. Secondly, instead of weighting each record equally, each user is weighted corresponding to the size of their local dataset. Moreover, user weight updates are allowed to originate from multiple local updates, each clipped independently compared to the global model state θ0_{. Furthermore, the summed weight}

update is scaled down by the expected number of users, instead of the number of records. Lastly, instead of composing the standard deviation of the gaussian noise into a noise scale σ and a clipping bound C, the DP-FedAvg algorithm defines the standard deviation as the noise scale z times the clipping bound S divided by the expected number of users qW . Therefore, the parameter σ in the DPSGD algorithm corresponds to the parameter z in the DP-FedAvg algorithm. While there may be a few differences, from a high level the two algorithms are also very similar. Both methods compute a number of gradients, clip each gradient to bound their sensitivity, and add noise scaled by this sensitivity. For a more detailed comparison between the two algorithms and how they compute their noised gradients we refer to appendix A.1.

Algorithm 2 The main loop for DP-FedAvg. The calls on the moments accountant M refer to the API of Abadi et al. [1]

Main training loop: parameters

user selection probability q ∈ (0, 1] per user example cap ˆ_{w ∈ R}+ noise scale z ∈ R+

estimator ˆff

UserUpdate FedAvg

ClipFn (FlatClip or PerLayerClip)

Init model θ0, moments accountant M wk = min(n_w_ˆk, 1) for all users k

W =P

k∈dWk

for each round t = 0, 1, 2... do Ct_{← (sample users with prob q)}

for each user k ∈ Ct _{in parallel do}

∆t+1_k ← UserUpdate(k, θt_{, ClipFn)}

∆t+1₌ Pk∈Ctwk∆k

qW

S ← (bound on ||∆||k for ClipFn)

σ ← zS qW

θt+1← θt_{+ ∆}t+1_{+ N (0, Iσ}2₎

M.acumm privacy spending(z) M.get privacy spent()

FlatClip(∆): parameter S

return π(∆, S) //See Eq.(2.1)

PerLayerClip(∆): parameter S1, ..., Sm S =qP jS 2 j

for each layer j ∈ 1, ..., m do ∆0(j) = π(∆(j), Sj)

return ∆0_j

UserUpdateFedAvg(k, θ0, ClipFn): parameters B, E, η

θ ← θ0

for each local epoch i from 1 to E do B ← (k’s split into B batches) for batch b ∈ B do

θ ← θ − η∇l(θ; b)

θ ← θ0_{+ ClipFn(θ − θ}0₎

return update ∆k = θ − θ0 //already

(13)

The DP-FedAvg algorithm offers two options for clipping. Both make use of the `2

-projection π, defined as:

π(∆, S)def= ∆ · min(1, S

||∆||). (2.1) The PerLayerClip function is similar to the clipping function used in the DPSGD algorithm but differs in the ability to choose a different clipping bound for each layer. Choosing different clipping bounds per layer might be beneficial, as the size of the gradients may vary. The FlatClip function is less flexible and clips all layers at once by clipping the concatenation of the gradient matrices. As the FlatClip function only clips once, the moments accountant is also called once. Intuitively, since the `2 norm is larger for the

concatenation of all layers, a larger S must be used, resulting in a larger σ. However, as the moments accountant is only called once, the z parameter can be lowered while reaching the same privacy spending per iteration.

2.4.3 Hyperparameter tuning for differential private SGD

The DPSGD algorithm introduces three new parameter when compared to non-private learning: the noise scale σ, the clipping bound S and the lot size L. To enable a good trade-off between privacy and utility, all parameters should be carefully chosen. The clip-ping bound parameter does not directly influence privacy. However, choosing inappropriate values may destroy utility. With a low clipping bound, many gradients are larger than the bound. This results in a lot of scaling, which can make large and small gradients indis-tinguishable. A high clipping bound adds a lot of noise, as the clipping bound determines the sensitivity of the update, by which the noise is scaled. The most popular approach is to treat S as a hyperparameter. Mcmahan et al.[18] examined the difference between clipping per layer and clipping the concatenation of all layers in the DP-FedAvg algorithm for a recurrent language model, but did not find significant differences in terms of per-formance. In their work, they pointed to a weight-aware clipping schemes as a possible direction for future work to optimize performance. The second parameter is the lot size, which does influence the privacy spending: roughly speaking, when computing a statistic over more examples, more noise should be added. However, for a typical sized machine learning dataset and a fixed number of epochs, the noise scaled required to reach the same does not grow linearly with the batch size. Consequently, for larger batches less noise per example is required to reach the same privacy guarantee. Therefore, intuitively, differential private models should benefit from large batch training. However, simple grid searches over individual hyperparameter from Abadi et al. [1] demonstrated that larger batches do not always result in better performance. In non private learning, a large amount of research has been devoted to learning with large batches. Large-batch training is non-trivial and introduces new hyperparameter dynamics. Specifically, work from Goyal et al. [13] provide explanations why large batches were considered non-optimal by Abadi et al.: it is recom-mended to linearly scale the learning rate when increasing the batch size, to reach the same accuracy. Lastly, the σ parameter most directly reflects the trade-off between privacy and utility. A higher σ value results in more privacy, but also degrades the performance of the model.

(14)

2.4.4 Alternative differential private learning

Other approaches to enable differential private learning can be categorized by two phases. In the first phase, a differential private model or artificial dataset is created which is used in the second phase to produce the actual classifiation model. These approaches are often appealing as once a private dataset or data generating model is created, the classification model can be trained while the privacy budget is independent of the number of training epochs. Papernot et al [20][21]. introduced a method called Private Aggregation of Teacher Ensembles (PATE). In their work, an ensemble of teacher models is trained on small subsets of the data with a very small privacy budget. Subsequently these models are used to label a public unlabeled dataset which is used for training the actual classification model. The privacy budget for this approach is much lower than traditional DP machine learning approaches. However, this approach relies on a public unlabeled dataset, which may not be available in many practical scenarios. Similarly to this approach, Phan et al. [22] introduced a method called ”Adaptive Laplace Mechanism”, creating a private affine transformation from the input where less relevant features are noised more and vice versa. This transformation layer is subsequently used to train a neural network with high accuracy.

2.4.5 Differential privacy as a defense mechanism to prevent

learning private information

Some earlier research concludes that differential privacy is a solution for the privacy prob-lem at hand [6, 23, 27]. However, there are examples[14] that claim differential privacy is not effective against the proposed attack. From these works various conclusions can be made. Firstly, the granularity of DP should always be carefully set. When a model is learned on a dataset that is distributed over users, record-level DP does not protect against correlations that can be found within one user. Secondly, given that the granular-ity of DP is correct, privacy can always be achieved by ensuring a extremely low budget. Consequently, the utility of the learned statistics are destroyed by the large amount of noise and nothing is learned. Differential private learning must always ensure privacy. Whether an effective model can still be learned within the resulting bounds depends on how correlations are distributed in the dataset. Therefore, DP is only useful for problems where an acceptable trade-off between privacy and utility exists. For problems where such a trade-off does not exist, differential privacy should destroy the performance of the model to achieve privacy.

(15)

Experiments and Results

To enable practical application of differential private learning, guidelines for choosing pri-vacy parameters that result in a satisfying performance pripri-vacy trade-off are required. Additionally, correctly following these guidelines should result in a model that is robust against privacy attacks. In earlier work, the privacy guarantee of differential private models are evaluated by exposing models to privacy attacks after training. Such a post-hoc ap-proach is unsuitable for practical applications, as it requires a possibly non-private model to be trained on sensitive information to find optimal DP-parameters. Ideally, to exam-ine the privacy guarantees of differential private training pipelexam-ines in a practical setup, sanity checks should be performed before training without learning on the actual data. Such sanity checks should be repeatedly applied until parameters are found that provide satisfying privacy guarantees. Furthermore, hyparameters that influence model per-formance require more careful attention in a private setting. When multiple differential private models are trained to tune hyperparameters, we have to subtract the accumu-lated privacy spending from our pirvacy budget. For non-private deep learning, there exist good practices for hyper-parameters to reach optimal performance. However, as differen-tial private deep learning is relatively new, such good practices barely exist for DP-related hyper-parameters. Therefore, as training multiple models is undesirable, hyperparameters should either be set adaptively during training or heuristics for choosing them must be invented.

In the next sections, models are trained on CIFAR-10 to test their performance. Ad-ditionally models are trained on 50,000 28x28x3 random noise images, similar to Zhang et al. [28], to test memorization capabilities. For CIFAR-10, the preprocessing is performed identical to earlier work[28]: pixel values are divided by 255 and center 28x28 patches are cropped from the image. Subsequently, every image is normalized by subtracting the mean and dividing the adjusted standard deviation independently for each image with the per image whitening function in Tensorflow [2][28]. The second experiment consists of user-level privacy checks, for which an artificial dataset is used, consisting of random noise with inserted patterns for some user, which will be discussed in more detail in section 3.4. Two well known model architectures are used in the experiments. The first model is a two layer MLP’s, with a hidden layer of 512 nodes and ReLU activation functions. Secondly, a small version of the Alexnet model [28] is used. Unless stated otherwise, both models are trained with SGD with a base learning rate of 0.01 and no regularization. For the first and the last two experiments, the moments accountant from Abadi et al. [1] is used, while the second experiment uses an adapted version of the DP-FedAvg algorithm from McMahan et a. [18]. Both differential private algorithms use the moments accountant to track for δ = _N11.1, where N is the number of data points in the training set.

(16)

3.1 Memorization with differential privacy

In this section, a first sanity check for differential private learning is proposed: train a model on random noise and asses the level of memorization. In differential private train-ing, memorization should not be possible as it requires learning record-specific features. If memorization is possible, in theory the same model trained on real data could also mem-orize privacy sensitive information. To produce strong sanity checks, privacy guarantees of models should be verified in an environment where privacy violations are easily made. Privacy preserving methods should also work in worst case scenarios, as we cannot make assumptions about the nature of the data. Then, when privacy is not violated in the sanity check, we have some confidence that privacy will not be violated on real data. Zhang et al. [28] showed that models trained on random noise converge faster than models trained on fake labels, which is therefore the more favorable option in this setting. Furthermore, Arpit et al. [4] concluded that networks tend to memorize more on random noise compared to real data. However, we stress that prevention of memorization is a necessary condition for privacy, not a sufficient condition.

We train the small Alexnet model for 60 epochs on a dataset of 50,000 random noise examples with zero mean and unit variance, with the same dimensions as the input of the actual task, CIFAR-10. The model is optimized with momentum with a momentum coefficient of 0.9, similar to the experiments in Zhang et al [28]. To do this, the original Tensorflow implementation of the DPSGD algorithm was extended to use the Tensorflow Momentum optimizer. The model is trained twice, once with the DPSGD algorithm with momentum and once with non-private SGD with momentum. In the first experiment the models are trained on random noise and the training accuracy is reported. Subsequently, the same models are trained on CIFAR-10. Here, the test accuracy and difference between train-and test accuracy are reported for both models. The differential private models use σ = 0.7225 for each layer and a batch size of 128, resulting in = 20 for δ = _N11.1 with

N = 50,000 after 60 epochs. Prior to this experiment a grid search (See Appendix B.1) over S was performed to find the value that results in the best test accuracy for this setup. Layers are clipped independently using S = 2.0. The results are reported in Figure 3.1.

Figure 3.1: Sanity check memorization on random noise (a), performance on CIFAR-10 (b) and generalization gap on CIFAR-10 (c)

a) Train accuracy b) Test accuracy c) Generalization gap

It can be concluded that even with a large privacy spending, differential privacy effectively prevents memorization of random noise. The non private model learns random noise with a final training accuracy of 100%, whereas the private model reaches a training accuracy

(17)

very close to random: 12.5%. Simultaneously, DP-models are capable of learning on real data. However, it should be noted that this privacy comes with a costs: the performance of training on actual data drops from 75.7% to 63.8% when using default parameters. Lastly, when training differential private models on real data, the generalization gap is smaller compared to non-private training, indicating that differential private models generalize better while memorizing less, compared to non private models.

3.2 Learning user-level patterns

In real world settings, data is often partitioned between users, requiring user-level differen-tial private learning. This section focuses on learning in a user setting with the DP-FedAvg algorithm. While hyperparameter optimization methods can be easily transferred from the DPSGD algorithm to the DP-FedAvg algorithm, the sanity check for memorization does not suffice in a user setting. Therefore, we introduce a sanity check for learning user pat-terns. With the right user-level differential privacy settings, extracting correlations that merely exist within the data of one user should be prevented, while allowing for learning correlations that exist among multiple users. The previous sanity check examines the abil-ity to learn patterns from one record. Similarly, this sanabil-ity check examines the abilabil-ity to learn patterns from the data of a single user.

For this sanity check, we create two artificial datasets. Both datasets contain 1,000 users with each 10 records resulting in 10,000 records in total. All records contain 28x28 random noise images with zero mean and unit variance, and one random label out of 10 options. For Np of the 10,000 records, a pattern is inserted: in the 14x14 upper left patch, all pixel

values have been set to 1.0 and the label is set to label 1. As a result, there are no corre-lations between the input and the output in this dataset, except for the records with the pattern inserted. With this sanity check, we aim to examine to what extent a pattern is learned when the pattern only exists in the data of one of the users.

To perform this experiment, the DP-FedAvg algorithm was implemented and used to simulate a distributed environment where a two-layer MLP is trained on the users of the artificial datasets. Instead of computing all user updates in parallel in a distributed setting, the updates are computed in series on a single machine. The private models are trained with noise scale z = 10.3 resulting in = 1.1 after 10 epochs with a per-user example cap

ˆ

w = 10 and a user selection probability q = 0.2. In the UserUpdateFedAvg function, users train on their dataset for one epoch with a local batch size B = 10 and a learning rate η = 0.2. Subsequently, each layer is clipped. The `2 norm bound is optimized by choosing

the median over the course of training for each layer as suggested by Abadi et al. [1], the resulting bounds are reported in appendix B.3. The non-private model is trained in the same setting, but without clipping and noising.

(18)

The first of our two datasets contains the sanity check, which will be referred to as the centralized dataset from now on. For this dataset, the records containing the pattern are owned by one user. This results in 999 users without the pattern, and 1 user with the pattern, for all 10 records owned by that user. We also created an artificial test-set, consisting of 100 users with each 10 records with the pattern inserted for all of their records. This test-set examines to what extent the inserted pattern in the training set is learned. To examine whether the model can learn the pattern from a single user we train a model with and without user-level differential privacy on this dataset. We report the accuracy on the test-set in Figure 3.2 (left).

Figure 3.2: Training MLP’s with centralized or distributed patterns

MLP on centralized dataset MLP on distributed dataset Np = 100

In the centralized setting, we are able to learn the pattern without differential privacy but not with differential privacy. This is exactly the desired behaviour, and we can conclude that user-level differential privacy effectively prevents learning correlations that merely ex-ists in the data of individuals. However, we stress again that this sanity check is also a necessary condition for differential privacy, and not a sufficient condition.

However, for differential private learning to be practical, we must be able to learn the pattern when it is distributed over more users. It turns out we can, but we need to in-crease the total number of patterns Np. To examine this, a second dataset is created, from

now on referred to as the distributed dataset. For this dataset, the records containing the patterns are partitioned over 100 different users. This results in 900 users without the pattern, and 100 users with the pattern for one of the 10 data records. We use the same test set and training procedure and report the results in Figure 3.2 (right).

Note that there is a difference in the two datasets: for the centralized dataset, Np = 10,

while for the distributed dataset, Np = 100. Therefore we cannot compare the distributed

and the centralized setting. However, this experiment is not meant as a fair comparison between learning distributed and centralized patterns. With differential privacy we need more correlations to successfully learn a pattern. But at the same time, it is necessary to only allow learning from many examples to ensure privacy.

In the distributed setting we can learn the pattern both with and without differential privacy, given that the pattern exists among enough users. This may put constraints on

(19)

the range of datasets that can be used for differential private learning. However, it is a direct consequence of privacy.

3.3 Adaptive clipping

One of the main challenges for training a model with the DPSGD algorithm is choosing a good clipping parameter S. Clipping does not directly influence the DP guarantees but may destroy model performance when not carefully chosen. To get an intuition why choosing a single good clipping bound is a difficult problem, the `2 of the gradient per layer is plotted

over the course of training for a non-private model in Figure 3.3. Two observations can be made: 1) the `2 norms of layers may be very different in the beginning of training

compared to the end of training. This is in line with conclusions by McMahan et al [18] who found that the size of gradients may vary over the course of training and therefore a single clipping bound may be suboptimal. As a result, they pointed in the direction of weight aware clipping schemes. 2) The size of the gradient may be different for each layer, especially when comparing weights and biases. Hence, we expect that a good clipping bound should change over the course of training and has a different value for each layer. Manually choosing such values would require a lot of fine tuning and thereby quickly spend the privacy budget. In their work, Abadi et al. [1] suggested that a good rule of thumb for choosing the `2 norm is to use the median `2 norm over the course of training. Combining

this with the suggestion for weight aware clipping schemes, we propose a gradient aware clipping scheme. This adaptive clipping schedule uses the mean `2 norm of the previous

batch times a constant factor α as the `2 norm bound for the current batch L. We define

the per layer clipping bound C_tlfor round t and layer l over the individual gradients gl_t−1(xi)

from that layer of the previous round t − 1:

C_tl = α · P i∈Ltkg l t−1(xi)k2 |L| (3.1)

To compare the proposed clipping scheme two versions of the small Alexnet model are trained with the DPSGD algorithm, this time without momentum. The first model uses a constant clipping bound which is optimized by a grid search over S (B.1). The second model uses the adaptive clipping scheme with α = 1.1. The test accuracy is reported for both models in Figure 3.4.

Figure 3.3: `2 norms

Figure 3.4: Test accuracy vs clipping methods

(20)

The proposed clipping scheme improves performance for this model on CIFAR-10. The test accuracy climbs from 61.6% to 63.4%. The difference in performance originates from the last third of training. This is in line with conclusions from McMahan et al. [18] who found that later in training, a non-optimal noise scale has a larger effect on the performance of the model. However, the gain in performance is relatively small and might be specific for this dataset.

To examine the stability of the α parameter, we trained models with adaptive clipping on MNIST, CIFAR-10 and CIFAR-100 to find the optimal value for α. The results are re-ported in Appendix B.2. We concluded that a α value of 1.0 is near optimal for all datasets. Therefore, the proposed adaptive clipping scheme reduces the number of tuneable privacy parameters without lowering the performance.

The proposed adaptive clipping scheme uses the mean `2 norm of the previous batch to

compute the `2 norm bound of the next batch. In a distributed setting, computing this

mean reveals extra private information for each user. To prevent this, methods used in work by Bonawitz et al. [5] can be used. In their work, a secure aggregation protocol is introduced to privately compute average models. The same protocol can also be used to jointly compute the mean of `2 norms without revealing the individual values.

3.4 Large batch training

In this section, the relation between batch size and model performance is examined. Using large batches may be beneficial as the added noise per example is smaller for larger batches. To show the relation between batch size and noise per example, the moments accountant is used to find to find the minimum σ that fully utilizes a pre-defined privacy budget in 10 epochs for a non-existing training set of 60,000 examples. No models are trained, but calls to the moments accountant are made as if a model is trained with these training parameters. The results are reported in Table 3.1. Training with larger batches dramatically reduces the added noise per example. However, very large batches show diminishing returns. This becomes more evident in Figure 3.5, where the sigma per example is plotted against the batch size for a privacy budget of = 1.0.

Table 3.1: σ per example ×103

batch 128 256 512 1024 2048 4096 8192 =1.0 14.2 9.77 6.84 4.79 3.27 2.32 1.65 =2.0 8.47 5.42 3.63 2.48 1.71 1.20 0.84 =3.0 6.84 4.14 2.65 1.75 1.20 0.83 0.57 =4.0 6.06 3.55 2.20 1.41 0.94 0.64 0.45 =5.0 5.59 3.22 1.93 1.21 0.79 0.53 0.36

Figure 3.5: σ/example vs batch size for = 1.0

(21)

Earlier work demonstrated that simply increasing the batch size does not necessarily result in increased performance for differential private models [1]. We confirm this result and con-clude that the performance loss caused by using larger batches outweighs the performance gains obtained by the lower loss/example ratio.

Goyal et al. [13] introduced new guidelines for large-batch training without deteriora-tion of performance. In the following experiments, one of the ideas from their work is applied to differential private training to examine whether the benefits of larger batches on the noise per example ratio can be exploited. Small Alexnet models with varying batch sizes are trained with the DPSGD algorithm on CIFAR-10 for 60 epochs with a fixed pri-vacy budget of = 20. A base learning rate of 0.01 is used for a batch size of 128. When the batch size is scaled by k, the learning rate is also scaled by a factor k as suggested by Goyal et al. [13]. To find a σ that most tightly utilizes the privacy budget, we perform a grid search over σ’s with the moments accountant until an accumulated privacy loss of = 20 ±0.05 after 60 epochs is found. In Figure 3.6 the test accuracy is plotted for multiple batch sizes over the course of training. In Table 3.2 the highest test accuracy is reported for these models.

Figure 3.6: Test accuracy multiple batch sizes Table 3.2: Batch size vs accuracy

Batch size accuracy 128 61.6% 512 64.2% 1024 66.9% 1024 (base lr) 47.2%

It can be concluded that training differential private models with larger batches can be beneficial, but only when the learning rate is scaled accordingly. With this result, we now have guidelines for choosing important differential privacy parameters without finetuning. We can use adaptive clipping to set the `2 norm bound, and choose a large batch size

(22)

Conclusion

For differential private learning, hyperparameter optimization on real world data is unde-sirable. The proposed methods enable an approach to differential private learning without spending privacy on hyperparameter tuning before training the final model. Given a clas-sification tasks and data dimensions, we suggest the following guidelines for choosing the differential privacy parameters:

• Choose a model that is successfully tested on similar benchmark tasks and use default hyperparameters.

• Choose the largest batch size that fits on the training device(s) and scale the learning rate accordingly.

• Use the adaptive clipping method (3.1) with default value α = 1.0.

• Calibrate the noise scale parameter with the DPSGD or DP-FedAvg model on a centralized dataset of random noise until the proposed sanity checks succeed.

• When all sanity checks have passed, train the model on private data with the same settings.

The proposed sanity checks can be carried out centralized, before training and without using the actual data. Both sanity checks aim to create an environment where violating privacy is artificially made easy. However, the sanity checks are not sufficient conditions for private learning. Preliminary experiments showed that both using larger batches and using different optimizers, reduced the speed of memorization for models without DP, while both methods do not offer any privacy guarantees. Several suggestions for potential future work regarding sanity checks are identified. First of all, the shortcomings of the proposed sanity checks could be examined. It is unclear whether the proposed sanity checks are sufficient when adversaries actively attempt to influence the training process, such as in the privacy attacks proposed by Hitaj et al. [14]. Additionally, the proposed sanity checks are not compatible with conditional models. Carlini et al. [6] proposed a setting where artificial private information was inserted in a benchmark dataset to test the memorization of a conditional next character prediction model. Translating this setting to a sanity check with random noise and generic dataset dimensions could result in a good sanity check for conditional models.

The first differential privacy parameter we examined is the clipping bound. The pro-posed adaptive clipping scheme avoids tuning of the clipping bound parameter without reducing the performance of the model. While the suggested adaptive clipping schemes is succesfull in combination with a plain SGD optimizer, preliminary experiments revealed

(23)

that naively applying the same method to the Momentum optimizer resulted in a small drop in performance. The dynamics of combining adaptive clipping with different optimiz-ers are not well undoptimiz-erstood and are left open as a direction for future work.

Subsequently, the batch size parameter was examined. To optimize the accuracy of differ-ential private training, large batch training with a linearly scaled learning rate can be used to reduce the amount of added noise per example in a batch. Naturally, linearly scaling the learning rate only works under the assumption that the initial learning rate is well chosen with respect to the task and model. However, there is no room for learning rate optimization on the actual task, so this is quite a strong assumption. We propose work by Smith et al. [26] as a possible direction for future work. They introduced a method that estimates the optimal learning rate. This solves the problem of choosing a learning rate while simultaneously drastically reducing the required number of iterations.

With the proposed methods in place, we can ask ourselves what extra requirements should be met to enable practical private machine learning. To ensure full privacy during machine learning, differential privacy methods should at least be combined with either HE or MPC methods. Still, when these methods would be successfully combined, privacy could still be violated. Multiple users may share secrets, which can be learned by user-level differential private models. To protect against this, we could consider group-level privacy, by defining group-adjacent datasets. These can can be formed by adding or removing all the data of a group of users. However, defining groups of connected users entails a whole collection of new challenges.

Privacy is a more relevant topic than ever, and the field of private machine learning is developing rapidly. Successfully applying private machine learning in a practical setting will require combined expertise from machine-learning, cryptography and information the-ory. We might not be there yet, but we consider this a step in the right direction.

(24)

Bibliography

[1] M. Abadi, A. Chu, I. Goodfellow, H. Brendan McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep Learning with Differential Privacy. ArXiv e-prints, July 2016. [2] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig

Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Ra-jat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Van-houcke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016. [3] Mart´ın Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differen-tial privacy. https://github.com/tensorflow/models/tree/master/research/ differential_privacy/dp_sgd, 2016.

[4] Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR, 2017.

[5] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. CoRR, abs/1611.04482, 2016.

[6] Nicholas Carlini, Chang Liu, Jernej Kos, ´Ulfar Erlingsson, and Dawn Song. The se-cret sharer: Measuring unintended neural network memorization & extracting sese-crets. CoRR, abs/1802.08232, 2018.

[7] Nathan Dowlin, Ran Gilad-Bachrach, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. Technical report, February 2016.

[8] Cynthia Dwork. Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), volume 4052, pages 1–12, Venice, Italy, July 2006. Springer Verlag.

(25)

[9] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag.

[10] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, August 2014.

[11] Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union, L119:1–88, May 2016.

[12] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1322–1333, New York, NY, USA, 2015. ACM.

[13] Priya Goyal, Piotr Doll´ar, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large mini-batch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.

[14] Briland Hitaj, Giuseppe Ateniese, and Fernando P´erez-Cruz. Deep models under the GAN: information leakage from collaborative deep learning. CoRR, abs/1702.07464, 2017.

[15] Jakub Konecn´y, H. Brendan McMahan, Felix X. Yu, Peter Richt´arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for im-proving communication efficiency. CoRR, abs/1610.05492, 2016.

[16] Ninghui Li, Wahbeh Qardaji, and Dong Su. On sampling, anonymization, and dif-ferential privacy or, k-anonymization meets difdif-ferential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASIACCS ’12, pages 32–33, New York, NY, USA, 2012. ACM.

[17] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Ag¨uera y Arcas. Federated learning of deep networks using model averaging. CoRR, abs/1602.05629, 2016.

[18] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning dif-ferentially private language models without losing accuracy. CoRR, abs/1710.06963, 2017.

[19] Payman Mohassel and Yupeng Zhang. Secureml: A system for scalable privacy-preserving machine learning. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pages 19–38, 2017.

[20] Nicolas Papernot, Mart´ın Abadi, ´Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In Proceedings of the International Conference on Learning Representations, 2017.

(26)

[21] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and ´Ulfar Erlingsson. Scalable private learning with pate. 2018.

[22] NhatHai Phan, Xintao Wu, Han Hu, and Dejing Dou. Adaptive laplace mechanism: Differential privacy preservation in deep learning. CoRR, abs/1709.05750, 2017. [23] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks

against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, May 2017.

[24] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1310–1321, New York, NY, USA, 2015. ACM.

[25] Reza Shokri, Marco Stronati, and Vitaly Shmatikov. Membership inference attacks against machine learning models. CoRR, abs/1610.05820, 2016.

[26] Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of residual networks using large learning rates. CoRR, abs/1708.07120, 2017.

[27] Yue Wang, Cheng Si, and Xintao Wu. Regression model fitting under differential pri-vacy and model inversion attack. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 1003–1009. AAAI Press, 2015.

[28] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un-derstanding deep learning requires rethinking generalization. CoRR, abs/1611.03530, 2016.

(27)

Appendix A

A.1 Comparison DPSGD algorithm and DP-FedAvg

algorithm

In this section, we compare the formulas for computing the noise gradients of the DPSGD algorithm and the DP-FedAvg algorithm. We begin by writing the noised gradient in the same notation and try to expand both formulas in such a way that they can be easily compared. We conclude that both algorithms are very similar, with a few subtle differences.

1. For the DPSGD, the noised gradient_egt is defined as:

e gt= 1 L X i gt(xi) + 1 L N (0, σ 2_C2_I) _(A.1)

With group size L, individual updates gt, noise scale σ and gradient norm bound C.

We can rewrite this as:

˜ gt= P igt(xi) L + N (0, σ2_C2 L2 I) (A.2)

2. For the DP-FedAvg algorithm we can define a similar noised gradient as _egt as:

˜

gt= ∆t+1+ N (0, σu2I) (A.3)

Where we use σu to denote the σ parameters used in the DP-FedAvg algorithm,

which is different from the σ used in the DPSGD algorithm. We can expand ∆t+1

and σu to get : e gt= P k∈CTwk∆k qW + N (0, z2_S2 q2_W2I) (A.4)

With user update ∆k, user selection probability q, noise scale z, and gradient norm

bound S. W is defined as:

W =X

k∈d

wk (A.5)

where d is the set of all users. And wk = min(n_w_ˆk, 1), where nk is the number of

training examples for user k and ˆw is a parameter that defines the per-user example cap.

3. If we now assume ∀j, k : nj = nk, and we set: ∀k : ˆw = nk. it follows that:

min(nk ˆ

(28)

and X k∈d wk = W = N (A.7) and qW = qN = L (A.8)

Where N is the number of users in the dataset and L is the group size of a random sample with sampling probability Q.

Lastly, we use that the DPSGD clipping bound C equals the DP-FedAvg clipping bound S. combined with (A.8) to rewrite (A.4):

˜ gt = P i∈Lt∆i L + N (0, z2C2 L2 I) (DP-FedAvg) (A.9)

compared with (A.2)

˜ gt= P i∈Ltgt(xi) L + N (0, σ2_C2 L2 I) (DPSGD) (A.10)

Now, three differences remain. Firstly, the DP-FedAvg computes the user update ∆i, while the DPSGD algorithm computes the record update gt(xi). Secondly, σ has

a different meaning in the two algorithms. For the DPSGD algorithm σ is the noise scale while for record-level σ is the variance parameter of the gaussian that follows from the noise scale z among others. Lastly, the DPSGD applies the learning rate on the noise gradient while the DP-FedAvg algorithm applies the learning rate before clipping, when computing the user updates.

(29)

Appendix B

B.1 Grid searches constant `

2

norm bounds

This section reports the results on the gridsearches for the `2 norm bounds in Experiments

3.1 and 3.3. The small Alexnet was trained with a privacy budget of = 20.

Figure B.1: Test accuracy for a grid search over `2 norm

bound per layer (Momentum experiment 3.1)

bound per layer (SGD experiment 3.3)

bound concatenated (SGD experiment 3.3)

B.2 Gridsearch α for adaptive clipping

This section reports the results on the adaptive clipping parameter α from equation (3.1) for MNIST, CIFAR-10 and CIFAR-100. For all datasets, the small Alexnet is trained with a privacy budget of = 20.

Test accuracy for a grid search over α MNIST

Test accuracy for a grid search α over CIFAR-10

Test accuracy for a grid search over α CIFAR-100

(30)

B.2.1 Concatenated clipping

We have also experimented with concatenated adaptive clipping. We concluded that this degrades performance. With concatenated adaptive clipping we reached a maximal accu-racy of about 60% on CIFAR-10 with a privacy budget of = 20.

Figure B.4: Test accuracy for a grid search over α for adaptive concatenated clipping of Alexnet on CIFAR-10

B.3 `

₂

norm bounds user-level experiments

This section reports the l2norm bounds used in experiment 2.

Table B.1: median `2 norm bounds per layer over the course of training

Layer `2 norm bound

weights layer 1 0.366 bias layer 1 0.026 weight layer 2 0.235 bias layer 2 0.030

A Practical Approach to Differential Private Learning

MSc Artificial Intelligence

Master Thesis