Privacy Techniques in Deep Learning

1.Introduction

With the apparition of internet, a large amount of data have been generated. In fact, this a exponential growth. For example, each day about 500M tweets are sent, 95M photos and videos are shared in Instagram, 500M photos are been uploaded. This exponential growth has allowed the use of new techniques in Artificial Intelligence. For example Deep Learning, where models can use this large amount of data to create useful applications, which we use everyday. In this dynamic interaction, where models and data combine into services, we need to address an important aspect: privacy. Privacy is an important area which is presented in many branches of science. It is an important property, which must be maintained. However, this can be different, according to the context in which is applied. In particular case of deep learning, we are interested in preserve privacy at both levels:

  1. Data, we need to assure, that the data been used, must retain its privacy property, and in
  2. Models, we must protect the learned information, such as, no leaking is present.

In the following sections, we will apply different privacy techniques in both levels.

2. Differential Privacy (DP)

Suppose we have a large data set, for which we have little or no labels. In a case like this, we could apply different techniques to label the data. Among them, we could use external models to classify our data. Then, we can average their predictions (similar to voting models) and get our labels. Of course, we need to train our own model afterwards. However, how can we guarantee that no information from the external data set will be leaked into our labels. At first, it could seems innocent to think that, we cannot use the external models to get information about their data sets. In truth, it can be done. Therefore we need to implement a framework, which:

  1. Allow to label our data, and at the same time
  2. Maintains the privacy of the external data sets.

The DP framework is the answer to this problem. DP, allow us to use the predictive capacity of external models, in a fashion, where privacy is maintained. In fact, we add a new hyper parameter to our training, which will maintain the privacy. In this scenario, a set of teachers (external models) will be used to aid in the generation of labels. In this setting, we use each teacher prediction to get a label. Then, we can take the maximum in a voting fashion. However, we need to take into consideration the privacy factor. Therefore, we will add an extra factor into the process. This, will be defined as epsilon, and will measure the level of information leaking as well the and the maintained privacy. Thus, epsilon became another hyper parameter that we must tune. Furthermore, this hyper parameter, will ensure that no information is leaking into the generated labels. Now, since we will add noise, there are many approaches we can use to generate it. Some famous options are Gaussian and Laplace noise. Now, lets explore DP with an example. In this example, we are going to use the digits data set. This data set, as we can see in the image; contains hand written digits, which vary from zero to nine. The images are contained into an array.

Now, we need to simulate the teachers, remember that these teachers will help us to generated our labels. In this case, they can be represented by any model. For simplicity, lets assume that we already have ten teachers. This means, we have already divided the data into eleven blocks. Ten for the teachers and one for our local model. In this setting is important to add randomness to the split process. In real applications however, ten teachers could result in no variations when applying PATE analysis (more later), but for now it will be enough for our purposes. Using the teachers, we will obtain a set of different predictions to our local data set as seen below:

array([[1., 6., 6., ..., 6., 6., 6.],
       [1., 9., 9., ..., 9., 9., 9.],
       [1., 9., 9., ..., 9., 9., 9.],
       ...,
       [1., 6., 6., ..., 6., 6., 6.],
       [1., 8., 2., ..., 2., 8., 9.],
       [1., 3., 3., ..., 3., 3., 3.]])

Now, we need to add the privacy factor: epsilon into the label generation. For this, we will use the Laplacian noise as follows:

def laplacian_noise(preds, epsilon = 0.1):
    new_labels = list()
    for pred in preds:
        label_count = np.bincount(pred.astype(int), minlength = 10)
        beta = 1/epsilon
        for i in range(len(label_count)):
            label_count[i] += np.random.laplace(0, beta, 1)
        new_label = np.argmax(label_count)
        new_labels.append(new_label)
    return new_labels

In this case, since we have ten classes, the bincount function takes a value of 10. At this stage, we already have our labels, the next step is training our local model. For that, lets use the following shallow network:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 50)
        self.fc2 = nn.Linear(50, 50)
        self.output = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.log_softmax(self.output(x), dim = 1)
        return x

Now, lets explore the effect of the epsilon hyper parameter in more detail. Since we are using it to preserve privacy, it is important to note that:

  • Larger values will result in more privacy leaking, but high accuracy and,
  • Lower values will result in less privacy leaking, but low accuracy.

To see this, lets repeat the process, and generate labels using different epsilon values from 1e-5 to 10e+3. Also, lets repeat the training process. Finally, lets compare the accuracy result using the generated labels against the real ones. As we can see in the figure, there is a trade-off between privacy and accuracy.

This is an important factor to keep in mind when we perform Differential Privacy. In this case, we were able to see that, the epsilon value allow us to approximate to the real labels. This will mean a privacy leaking, which is not desired. However, we also must careful tune epsilon , since too much noise could negatively impact accuracy. Also, in this example, we are able to see this behavior since we have the real labels. However, in a real case, such information will not be available. Thus, we need to correct balance this trade-off. Now, lets add another analysis. With plot above, we were able to see the epsilon behavior. However, we would like also to see how the teachers influence the results. In other words, what is the impact if we change some teachers labels. This is called PATE Analysis. To perform this analysis, we will introduce PySyft. PySyft is a powerful python library, developed by OpenMined which allow us to apply several privacy techniques in the machine learning pipeline. Among them, we can use: Federated Learning, Differential Privacy and Multi-Party Computation. I highly recommend to checkout their official website to learn more. To install it, just use pip install syft. Now, back to our PATE Analysis, to see how it works, lets first apply it to an example. Lets say, we have generated our labels using 100 teachers. In order to run the analysis, we first need to import syft. Then we will measure the epsilon variation across the true and generated labels. As we can see in the code below, we need to pass the true and generated labels, alongside epsilon.

# import
from syft.frameworks.torch.differential_privacy import pate
# pate analysis with 100 teachers
num_teachers, num_examples, num_labels = (100, 100, 10)
p = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int)
indices = (np.random.rand(num_examples) * num_labels).astype(int)
pate.perform_analysis(teacher_preds = p, indices = indices, noise_eps = .1, delta = 1e-5)
# result
(11.756462732485105, 11.756462732485115)

Now, lets see how changes in the teacher predictions will impact our expected epsilon value.

# changing predictions, this will have an impact over epsilon
p[:, 0:5] *= 0
pate.perform_analysis(teacher_preds = p, indices = indices, noise_eps = .1, delta = 1e-5)
(8.42373882359225, 11.756462732485115)

As we can see, changing the predictions for the generated labels will impact. Now, lets do the same analysis to our predictions. As we did before, first we will calculate the actual epsilon value:

# import
from syft.frameworks.torch.differential_privacy import pate
# run pate analysis
pate.perform_analysis(teacher_preds = preds.T.astype(int), indices = y_true.numpy().astype(int), noise_eps = .1, delta = 1e-5)
# results
(15.776462732485108, 15.776462732485115)

Now, we change some of the predictions to see the variation:

# Change the values from two teachers to the class 0
preds[:, 5: 7] = 0
# next, lets repeat the pate analysis
pate.perform_analysis(teacher_preds = preds.T.astype(int), indices = y_true.numpy().astype(int), noise_eps = .1, delta = 1e-5)
# results
(15.776462732485108, 15.776462732485115)

Surprisingly, the value did not change at all. This is because, the number of teachers we are using. In this case, since we only have 10, the impact is not too much. On the contrary, if we had more teachers, changing the values in the predictions will have a significant impact over epsilon. This is an important aspect of Differential Privacy, what we must keep in mind when using this framework. Next, lets explore how we can add privacy into the training process.

3. Federated Learning (FL)

As mentioned before, the internet is continuously producing large amounts of data. For Deep Learning, we would like to use this data to get better models. However, this data is not necessary concentrated in a single place. Instead it is distributed among users. This leaves us with two options:

  1. Send the model to train with the user data
  2. Collect more data.

Lets address the first option. If we sent our model to the user, we must assume that, the user will have a sort of device to store the model. Now, considering how big a deep learning model is, this could result in a very consuming task for the user. Now, if we can collect more data, this data will not be actual data. Let me explain, take a text predictor for example, what if we wait a week to update our model?. Now, even if we are able to train our model remotely, what about the user privacy?. Again in the text predictor, what if the text data from the user is leaked somehow?. To address this problems, we will use the Federated Learning framework. In the case of Federated Learning, we can actually train our model, in a distributive fashion, maintaining the user privacy. In this framework, a copy of the model is sent to each user, which in this case are called workers. Then, this remote model will train on the user data. This training is not a consuming task, since the user will have little data in comparison with a large data set. Then after the remote models are trained, they return to the main server. Next, all the users could use the new model, which is now more accurate. Also, we have maintained the user data on their respective remote location. Thus no private information have been sent to the server. Now, lets see an example of Federated Learning into action. For this example, we are going to use PySyft alongside the well know iris data set . First, we need to link torch, with PySyft, this can be done as follows:

# import
import torch
import syft as sy
# configure torch
hook = sy.TorchHook(torch)

This is important, because it allow us to use the powerful tensor representations from torch into PySyft. Next, lets define our workers:

# define workers
worker_1 = sy.VirtualWorker(hook, id = 'worker_1')
worker_2 = sy.VirtualWorker(hook, id = 'worker_2')
worker_3 = sy.VirtualWorker(hook, id = 'worker_3')
worker_4 = sy.VirtualWorker(hook, id = 'worker_4')
worker_5 = sy.VirtualWorker(hook, id = 'worker_5')

We can define as many workers as we like. This workers can receive any type of tensor data. Therefore we can sent data sets and models as well. In this case, these are virtual workers. However, we can also define remote workers, which can operate across real devices. Now, lets send the data to the workers. This can be done using the command send:

x1 = X.send(worker_1)
y1 = y.send(worker_1)

This process need to be repeated to all the workers, since we want to send the data to them. Note that, we are sending different parts of the data set to each worker. This is handle by PySyft. Also, the data need to be in a tensor or data set representation (that's why we need to link torch). Now, lets define a shallow model to train:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 50)
        self.fc2 = nn.Linear(50, 50)
        self.output = nn.Linear(50, 3)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.log_softmax(self.output(x), dim = 1)
        return x

Next, we need to train our model using Federated Learning. For that, first, we are going to instantiate the server model:

# learning rate
lr = 0.005
# instanciate the model
server_model = Net()

Now, in the training loop, for each epoch, we are going to send a copy of our model, alongside the optimizer to each worker.

for epoch in range(epochs):
  # send model to remote workers
  model_w1 = server_model.copy().send(worker_1)
  model_w2 = server_model.copy().send(worker_2)
  model_w3 = server_model.copy().send(worker_3)
  model_w4 = server_model.copy().send(worker_4)
  # send optimizers to remote workers
  opt_w1 = optim.SGD(model_w1.parameters(), lr = lr)
  opt_w2 = optim.SGD(model_w2.parameters(), lr = lr)
  opt_w3 = optim.SGD(model_w3.parameters(), lr = lr)
  opt_w4 = optim.SGD(model_w4.parameters(), lr = lr)

Then we will perform the training on the workers. Here, we can specify a number of iterations iters. In this case, we set this to 1. This can be done using the normal Pytorch training procedure:

for remote_epochs in range(iters):
    # train remote models as usual
    # set zero grads
    opt.zero_grad()
    # y_hat
    y_hat = model(X)
    # calculate loss
    loss = criterion(y_hat, y)
    # backward
    loss.backward()
    # update
    opt.step()

Now, using the move command, we will move up the gradients to other worker. This is a secure worker. Ideally, this would be a remote machine, different from the server. In this case, we would work with worker 5 as our secure worker:

# move trained models to secure worker.
  model_w1.move(worker_5)
  model_w2.move(worker_5)
  model_w3.move(worker_5)
  model_w4.move(worker_5)

Then, in each epoch, we will average the gradients on worker 5, and return them back to the main server as follows:

# sum-up gradients on worker 5 and retrieve the model
  server_model = update_model()

This will assure that the raw gradients will not be accessed by the server. Instead, the server will receive an average of the gradients. Finally, we need to repeat the training process as many times as necessary. Now, lets train our model for 300 epochs, and see how good are the results:

As we can see, since each worker have different parts of the data set, in some instances, for example, workers 2 and 3 the results are not good enough. In fact, it seems that the losses have decreased linearly. On the contrary, workers 1 and 4 have a small loss. This scenario can occur due to different factors. In this case, it is probable that workers 2 and 3 lack some of the classes. This can also, can happen in a real application, however, we must remember that:

  1. The users will never have enough data.
  2. We are training among different devices, and
  3. We are preserving the users privacy

These are important factors to keep in mind when performing Federated Learning. Also, we will average the gradients. Therefore, our model will be more consistent. In fact, with more workers, we will obtain a better model. This can be checked, when we observe the loss for the average model. As we can see, during the training, we obtained a robust model, even if some workers have a low accuracy. Now, comes the best part. After training we can immediately put this model into production. In fact, in a real application, it is most probable, this model will already be online. Which means, the users will benefited from this new model, and we have maintain their privacy safe. However, there is a little detail we have not mention yet, in the next section, we will see how to add an extra security component into Federated Learning.

4. Encryption

Until now, we have successfully use Federated Learning, to train a model across different workers. However, we have skip a little detail: What if, the remote users can access the raw gradients?. Well, if that happens we were compromising not only the users privacy, but the model itself. Therefore, we need a way to protect the raw gradients. Now, why are the raw gradients so important?. Well, since we are dealing with neural networks, these gradients contain the learned features at each epoch. Therefore, they contain sensitive data. Fortunately, we can use PySyft to add this extra security layer to Federated Learning. In this case, we will encrypt the gradients with a technique called: secret additive sharing. Basically, this will ensure that both: data and gradients can be shared safely among users. This is achieved thanks to the encryption mechanism, which, in the case of neural networks, allow us to encrypt the gradients. Even more, we can perform operations over the data set using an encrypted representation. This means, we can encrypt our model, and shared across users safely. Lets explore this, with an example. For this, we will continuing using the iris data set. As we did before, first lets import PySyft:

import torch
import syft as sy
# set hook to use Syft natively in pytorch.
hook = sy.TorchHook(torch)

Now, lets create our workers. For this case, we will use two workers and one secure worker. This means, the data, and models will go to the two workers, and the gradients will be averaged in the secure one.

# define workers
worker_1 = sy.VirtualWorker(hook, id = "worker_1")
worker_2 = sy.VirtualWorker(hook, id = "worker_2")
secure_worker = sy.VirtualWorker(hook, id = "secure_worker")
# workers
workers = [worker_1, worker_2]

Next, lets create a data loader for the data set:

# batch size
batch_size = 8
train_data = T.utils.data.TensorDataset(X, y)
train_loader = T.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)

Here, we can choose any batch size, in this case 8 will do nicely. Now, lets use the shallow model from before:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 50)
        self.fc2 = nn.Linear(50, 50)
        self.output = nn.Linear(50, 3)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.log_softmax(self.output(x), dim = 1)
        return x

Then, lets prepare the model and optimizer, which are going to go the workers:

lr = 1e-1
# models
w1_model = Net()
w2_model = Net()
# optimizer
w1_optimizer = optim.SGD(w1_model.parameters(), lr = lr)
w2_optimizer = optim.SGD(w2_model.parameters(), lr = lr)
# model handle
models = [w1_model, w2_model]
# optimizer handle
optimizers = [w1_optimizer, w2_optimizer]

Next, lets send the data to the remote workers:

# remote data set
remote_dataset = create_remote_dataset(workers, train_loader)

Until now, we have used PySyft to handle this for us, this time, lets take a look of how the data looks like:

remote_dataset
([((Wrapper)>[PointerTensor | me:50328223559 -> worker_1:14315906653],
   (Wrapper)>[PointerTensor | me:75844939098 -> worker_1:17230189219]),
  ((Wrapper)>[PointerTensor | me:37932056988 -> worker_1:33968197949],
   (Wrapper)>[PointerTensor | me:42151516823 -> worker_1:96335109282]),
  ((Wrapper)>[PointerTensor | me:39015392698 -> worker_1:60347744208],
   (Wrapper)>[PointerTensor | me:14294371660 -> worker_1:44638155149]),
  ((Wrapper)>[PointerTensor | me:60349429531 -> worker_1:11126049901],
   (Wrapper)>[PointerTensor | me:88729752983 -> worker_1:40004468284]),
  ((Wrapper)>[PointerTensor | me:44017618903 -> worker_1:96777639795],
   (Wrapper)>[PointerTensor | me:49187292051 -> worker_1:20930525870]),
  ((Wrapper)>[PointerTensor | me:55001260214 -> worker_1:74386286628],
   (Wrapper)>[PointerTensor | me:76027132477 -> worker_1:8060752502]),
  ((Wrapper)>[PointerTensor | me:32676374501 -> worker_1:53128119493],
   (Wrapper)>[PointerTensor | me:67968050477 -> worker_1:82322356041]),
  ((Wrapper)>[PointerTensor | me:29524407347 -> worker_1:73202936472],
   (Wrapper)>[PointerTensor | me:84519690655 -> worker_1:70463351834]),
  ((Wrapper)>[PointerTensor | me:53926123912 -> worker_1:17932188851],
   (Wrapper)>[PointerTensor | me:69584189141 -> worker_1:7477432851]),
  ((Wrapper)>[PointerTensor | me:18704048791 -> worker_1:18091029429],
   (Wrapper)>[PointerTensor | me:81559707911 -> worker_1:79991891382])],
 [((Wrapper)>[PointerTensor | me:29232508871 -> worker_2:62275106862],
   (Wrapper)>[PointerTensor | me:99855746894 -> worker_2:22092944459]),
  ((Wrapper)>[PointerTensor | me:98017073618 -> worker_2:97990554458],
   (Wrapper)>[PointerTensor | me:94021126281 -> worker_2:3176311706]),
  ((Wrapper)>[PointerTensor | me:77537265689 -> worker_2:15734656888],
   (Wrapper)>[PointerTensor | me:50877084875 -> worker_2:67598335135]),
  ((Wrapper)>[PointerTensor | me:46710869885 -> worker_2:94150979477],
   (Wrapper)>[PointerTensor | me:7967568335 -> worker_2:9140584728]),
  ((Wrapper)>[PointerTensor | me:15777521707 -> worker_2:94060857941],
   (Wrapper)>[PointerTensor | me:56686135325 -> worker_2:88762713364]),
  ((Wrapper)>[PointerTensor | me:40330887552 -> worker_2:90143235180],
   (Wrapper)>[PointerTensor | me:99218251783 -> worker_2:88498807554]),
  ((Wrapper)>[PointerTensor | me:19625908576 -> worker_2:55244737527],
   (Wrapper)>[PointerTensor | me:25945607336 -> worker_2:74181601936]),
  ((Wrapper)>[PointerTensor | me:18603123641 -> worker_2:82601611042],
   (Wrapper)>[PointerTensor | me:78340423902 -> worker_2:33661479555]),
  ((Wrapper)>[PointerTensor | me:5790577759 -> worker_2:2775920993],
   (Wrapper)>[PointerTensor | me:14040424544 -> worker_2:72173214308])])

As we can see, instead of tensors, we have a nice set of remote pointers. This is one of the interesting features of PySyft. In fact, using this representation, we can perform any operation on these pointers, like with other tensors. However, we will not compromising the real data itself. Also, the results of such operations will also be pointers. Of course, we can also get the real values from this pointers. Now, lets add encryption to our model. For this, we will use the fix_precision() command to our model. This, will encrypt our model gradients. Next, we need to send this model to the workers using:

model.copy().fix_precision().share(worker_1, worker_2, crypto_provider=secure_worker)

Next, we will average the gradients in the secure worker with:

gradients = (gradient[0] + gradient[1]).get().float_precision()/2

In this case, note that, since we are using two workers, we need to divide these gradients by the number of workers. Also, in order to get the decoded gradients (remember, they are encrypted), we need to first call the command get() to obtain the data from the pointers, and then, use the command float_precision() to decrypt the gradients. Finally, we will update our model using the new gradients on the server side:

server_model = update_server_model(gradients)

Now, lets see how well Federated Learning works with Encryption:

As we can see, we are able to perform Encrypted Federated Learning, and obtain good results. Note that, this time we only used two workers, and, since the data set is small, the data was better distributed among the workers.

5. Conclusions

In this article, we have discussed several privacy techniques, which can be applied to any machine learning pipeline. In particular, we have explored: Differential Privacy, Federated Learning and Encryption. We have used PySyft to implement this techniques alongside Pytorch. One interesting point to note, is that, in all the training process for both: Federated Learning and Federated Learning using Encryption, we only have used SGD as optimizer. This is due the nature of SGD, which allow it to perform in steps. This means, we can used among different workers. In this case, using other optimizer, like Adam, will have other results (mostly negatively). Of course, this will definitely change, it is more probable that we will able to use other optimizer besides SGD. Also, as Prof. Andrew Trask mentioned, privacy is a young area, therefore, there are many useful approaches we can try. It is also important to note that, as practitioners and researchers, it is our responsibility maintain the privacy of the data. In the examples of this article, we have take different data sets. In a real application, there are a lots of considerations that we must keep in mind, when performing any of these techniques. The more important is the trade off between privacy and accuracy. Also, these techniques must be selected accordingly to the situation. Finally I would like to express my gratitude towards: Udacity, Facebook and Prof. Andrew Trask for developing the Secure and Private AI Challenge from which I was able to write the content of this blog.

Page top