Maintaining the privacy of data for the users has become crucial. With the rise of machine learning, the concerns regarding data misuse are taken seriously. People today want to know how we can prevent the misuse of our data. There is an increasing demand for data privacy. Consequently, many privacy regulations, such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), are regulating how companies can collect, process, and store personal data today.
AI Innovation, data, and the future
Amidst the data privacy concerns, we cannot ignore the fact that the Machine learning model is hungry for data. If we don’t feed data to our model, the Machine learning model cannot be trained. However, AI experts are designing new innovations to improve privacy and help make data anonymous in AI. We can drastically decrease this risk with two new approaches:
1. Differential privacy
According to Cynthia Dwork, Differential Privacy describes a promise, made by a data holder, or curator, to a data subject, and the promise is like this:
“You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources are available.”
2. Federated learning
Google introduced Federated Learning (FL) as “a specific category of distributed machine learning approaches which trains machine learning models using decentralized data residing on end devices such as mobile phones.” in 2017.
Traditionally, in machine learning, algorithms are trained with datasets that are stored on-premise or cloud-based servers. Whereas in the federated learning, instead of training ML models on a central server, the local ML models are trained on mobile devices based on user’s local data. The improved ML model weight is sent over the cloud but the user’s data never leaves the user.
Why Federated Learning?
Federated learning leverage user’s personal devices and does not share their personal information with the server. We can train a machine learning model without knowing any user and without having their personal information sent over to the server.
The advantage of using Federated Learning approach is as follows:
1. Data Privacy
The training data used is present in the remote device with the user. We return only the pointer where the model weight is present. It makes data more secure.
2. Data encryption
Instead of giving the original value of the data, it provides the encoded data. For example, the word “apple” is encrypted to it’s hashed value like 1f3870be274f6c49b3e31a0c6728957f”.
3. Network Latency
The use of federated learning reduces latency as client devices don’t share their data. Since we train the data within the local device, there is no hassle to consider transmission loss, networking, and routing glitches.
4. Faster Learning
Distributed learning is faster compared to training a single model with massive data. The training model runs parallel with a vast number of machines.
We train the model with the person-specific data.
6. Cost Saving
Since we store the data in the user’s device, it reduces the cost of data storage.
Are mobile phones powerful enough for local training?
The answer is YES!
Federated learning requires the processing capabilities of the edge devices to perform local training and inferencing. Today most smartphones and newly launched IoT devices are equipped with GPUs or sufficient computing hardware to run powerful AI models.
Types of federated learning
There are two types of federated learning approaches:
1. Single-party system
In the single-party system, only one device is responsible for distributed data capture and governing flow system. We train the models in a federated manner on the data that has the same data structure across all the client devices.
2. Multi-party system
In multi-party federated learning, two or more devices form a union to train a machine learning model in their dataset, keeping individual data is a significant addition in this approach. Moreover, individual data are not strict to be in the same structure but should be in a similar structure.
Federated Learning Architecture
The diagram above shows the basic architecture of Federated Learning. We can see that hand-held devices that participate in Federated Learning computing architecture download the model from the server. The model is a serialized command in a specific data structure such as JSON or tuple.
- It predicts based on user input and takes the difference as training data.
- It uses the training data and re-train model in the hand-held device.
- Weights are distributed and aggregated from all devices (union of devices).
- Send the pointer where aggregated weight back to the server.
- The server now updates the weight and sends updates to client devices.
Step by Step Coding Process to Work with PySft and Pytorch
Let’s learn how to build simple POC using PySyft and PyTorch. PySyft is a Python library for secure, private machine learning. PySyft can be hooked up with any of the deep learning frameworks like Pytorch, Tensorflow or Keras with capabilities for remote execution, federated learning, and differential privacy, homomorphic encryption, and multi-party computation.
1. Install the pre-requisites and import the module. We install Pysft in this example so that it can work in a remote device.
2. Let’s create remote devices (such as phone, laptop, etc) aka ‘workers’. We created two workers in this example which are ‘bipin’ and ‘aviskar’. We then, hook workers with deep learning models as shown in the example.
3. Because we are making two workers inside our machine, we should first convert the data into Federated Learning format.
4. From the picture below, we can see that we have a feature tensor called input_x, we have located this feature into a remote device called remote_loc. In actual implementation, this remote location is the location of the device which has been voted by the union of all hand-held device that governs the overall process for multiple devices taking part. Due to this, it would be very difficult to retrieve the data through reverse engineering.
5. Here is an example to show the developer how can we actually implement the Pytorch model to train the number of hand-held devices.
- Maintaining the distributed system(a large number of client workers).
- Maintaining synchronization can be the next challenge as users can turn on/off the devices randomly.
- Data imbalance can be a challenge as we don’t have any hint on data we are training.
- Data preprocessing is out of scope.
About the authors
Sushil Ghimire and Anish Pandey are Machine Learning Engineers at Leapfrog Technology. Sushil is proficient at natural language processing and has an equal interest in deep learning algorithms and data structures. While Anish is proficient at natural language processing, recommendation system, and data analysis.