sd

Get Started with Federated Learning for Data privacy


Maintaining the privacy of data for the users has become crucial. With the rise of machine learning, the concerns regarding data misuse are taken seriously. People today want to know how we can prevent the misuse of our data. There is an increasing demand for data privacy. Consequently, many privacy regulations, such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), are regulating how companies can collect, process, and store personal data today.

AI Innovation, data, and the future

Amidst the data privacy concerns, we cannot ignore the fact that the Machine learning model is hungry for data. If we don’t feed data to our model, the Machine learning model cannot be trained. However, AI experts are designing new innovations to improve privacy and help make data anonymous in AI. We can drastically decrease this risk with two new approaches:

1. Differential privacy

According to Cynthia Dwork, Differential Privacy describes a promise, made by a data holder, or curator, to a data subject, and the promise is like this:
“You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources are available.”

2. Federated learning

Google introduced Federated Learning (FL) as “a specific category of distributed machine learning approaches which trains machine learning models using decentralized data residing on end devices such as mobile phones.” in 2017.

Traditionally, in machine learning, algorithms are trained with datasets that are stored on-premise or cloud-based servers. Whereas in the federated learning, instead of training ML models on a central server, the local ML models are trained on mobile devices based on user’s local data. The improved ML model weight is sent over the cloud but the user’s data never leaves the user.

Why Federated Learning?

Federated learning leverage user’s personal devices and does not share their personal information with the server. We can train a machine learning model without knowing any user and without having their personal information sent over to the server.

The advantage of using Federated Learning approach is as follows:

1. Data Privacy

The training data used is present in the remote device with the user. We return only the pointer where the model weight is present. It makes data more secure.

2. Data encryption

Instead of giving the original value of the data, it provides the encoded data. For example, the word “apple” is encrypted to it’s hashed value like 1f3870be274f6c49b3e31a0c6728957f”.

3. Network Latency

The use of federated learning reduces latency as client devices don’t share their data. Since we train the data within the local device, there is no hassle to consider transmission loss, networking, and routing glitches.

4. Faster Learning

Distributed learning is faster compared to training a single model with massive data. The training model runs parallel with a vast number of machines.

5. Personalized

We train the model with the person-specific data.

6. Cost Saving

Since we store the data in the user’s device, it reduces the cost of data storage.

Are mobile phones powerful enough for local training?

The answer is YES!

Federated learning requires the processing capabilities of the edge devices to perform local training and inferencing. Today most smartphones and newly launched IoT devices are equipped with GPUs or sufficient computing hardware to run powerful AI models.

Types of federated learning

There are two types of federated learning approaches:

1. Single-party system

In the single-party system, only one device is responsible for distributed data capture and governing flow system. We train the models in a federated manner on the data that has the same data structure across all the client devices.

2. Multi-party system

In multi-party federated learning, two or more devices form a union to train a machine learning model in their dataset, keeping individual data is a significant addition in this approach. Moreover, individual data are not strict to be in the same structure but should be in a similar structure.

Federated Learning Architecture

Federated Learning

The basic architecture of Federated Learning

The diagram above shows the basic architecture of Federated Learning. We can see that hand-held devices that participate in Federated Learning computing architecture download the model from the server. The model is a serialized command in a specific data structure such as JSON or tuple.

  • It predicts based on user input and takes the difference as training data.
  • It uses the training data and re-train model in the hand-held device.
  • Weights are distributed and aggregated from all devices (union of devices).
  • Send the pointer where aggregated weight back to the server.
  • The server now updates the weight and sends updates to client devices.

Step by Step Coding Process to Work with PySft and Pytorch

Before jumping into the code, there are a few prerequisites that need to be installed in the local machine. Please follow instructions for Pytorch or Pysft.

Let’s learn how to build simple POC using PySyft and PyTorch. PySyft is a Python library for secure, private machine learning. PySyft can be hooked up with any of the deep learning frameworks like Pytorch, Tensorflow or Keras with capabilities for remote execution, federated learning, and differential privacy, homomorphic encryption, and multi-party computation.

1. Install the pre-requisites and import the module. We install Pysft in this example so that it can work in a remote device.

Install Pysft

2. Let’s create remote devices (such as phone, laptop, etc) aka ‘workers’. We created two workers in this example which are ‘bipin’ and ‘aviskar’. We then, hook workers with deep learning models as shown in the example.

Fedearted learning

3. Because we are making two workers inside our machine, we should first convert the data into Federated Learning format.

Fedearted learning

4. From the picture below, we can see that we have a feature tensor called input_x, we have located this feature into a remote device called remote_loc. In actual implementation, this remote location is the location of the device which has been voted by the union of all hand-held device that governs the overall process for multiple devices taking part. Due to this, it would be very difficult to retrieve the data through reverse engineering.

Fedearted learning

5. Here is an example to show the developer how can we actually implement the Pytorch model to train the number of hand-held devices.

Fedearted learning

Challenges

  • Maintaining the distributed system(a large number of client workers).
  • Maintaining synchronization can be the next challenge as users can turn on/off the devices randomly.
  • Data imbalance can be a challenge as we don’t have any hint on data we are training.
  • Data preprocessing is out of scope.

About the authors

Sushil Ghimire and Anish Pandey are Machine Learning Engineers at Leapfrog Technology. Sushil is proficient at natural language processing and has an equal interest in deep learning algorithms and data structures. While Anish is proficient at natural language processing, recommendation system, and data analysis.


Want to take the next leap and learn more about integrating AI in your product?

AI-Playbook

More in Insights

Switching from a Data Science Hobbyist to a Professional? Artificial Intelligence

Switching from a Data Science Hobbyist to a Professional?

As a data hobbyist, we might build many models with different data sets and alter these to analyze the outputs.

Read more
Nepali Image Captioning Using Machine Learning Artificial Intelligence

Nepali Image Captioning Using Machine Learning

Computer vision using deep learning and image processing is at the forefront of realizing the benefits of AI in the

Read more
Dimension Reduction Techniques for Machine Learning Models Artificial Intelligence

Dimension Reduction Techniques for Machine Learning Models

Dimension reduction might sound trivial but is an essential part of the machine learning model. Too much dimension acts as

Read more