Differentially Private Training
Imagine a TDC who wants to train a model for detecting payment fraud. The could do this by collecting labelled transaction data from multiple payment companies. The trained model might be quite useful, but it could also reveal a lot of information about the transactions, even if the TDC only has access to the trained model. Other kinds of models have been shown to be potentially vulnerable; credit card numbers have been pulled out of language models and actual faces reconstructed from image models.
The DEPA training framework supports model training using a robust approach based on differential privacy.
Differential Privacy
At its roots, differential privacy is a mathematical way to protect individuals when their data is used in datasets. DP allows high level trends and patterns within the dataset to be shared while withholding information about individuals. It guarantees that adversaries cannot discover an individual within the protected data set even by comparing the data with other datasets. In other words, an individual will experience no difference whether they participate in information collection or not. This means that no harm will come to the participant as a result of providing data.
Differential privacy works by introducing a privacy loss or privacy budget parameter, often denoted as epsilon (ε), to the dataset. ε controls how much noise or randomness is added to the raw dataset. The added randomness is controlled, therefore, the resulting dataset is still accurate enough to generate aggregate insights through data analysis while maintaining the privacy of individual participants.
Differential Private Training
There are several ways of training machine learning models with differential privacy. By far the most common is using Differentially Private Stochastic Gradient Descent (DP-SGD). DP-SGD prevents the model from memorizing or leaking sensitive information about the data by adding noise to the gradients during the optimization process. The amount of noise is carefully calibrated to satisfy a mathematical definition of differential privacy, which guarantees that the model's output is almost independent of any single data point. DP-SGD can be applied to various types of models, such as deep neural networks, and has been used for tasks such as natural language processing and computer vision. DP-SGD can be applied to fine-tune models while preserving the privacy of the task-specific data.
Here are some references that explain DP-SGD in more detail:
- Differential Privacy Series | DP-SGD Algorithm Explained: This blog post provides a gentle introduction to DP-SGD and its core concepts, such as clipping and noise addition.
- Deep Learning with Differential Privacy (DP-SGD Explained): This article gives a more formal definition of differential privacy and shows how DP-SGD can be implemented using PyTorch.
- An Efficient DP-SGD Mechanism for Large Scale NLP Models: This paper presents an efficient implementation of DP-SGD for fine-tuning large-scale NLP models based on LSTM and transformer architectures.
Differential Private Training/Fine-tuning in DEPA
The DEPA training framework provides TDPs with mechanisms to ensure that TDCs use their datasets in a way that protects the privacy of data principals. Using these mechanisms, TDPs can meet compliance requirements
- TDPs create anonymized datasets by using standard de-identification approaches such as k-anonymity, masking and scrubbing as defined by the TSO. In addition, TDPs allocate a privacy budget for each dataset based on recommendations of the TSO. The privacy budget bounds the information a TDC can learn about de-identified records in the dataset.
- When TDPs and TDCs sign contracts, they allocate a certain fraction of the privacy budget for each training run. This is specified in the privacy constraint in the contract.
- When a CCR is created to train a model, it requests data encryptions keys from the TDP. The TDP SHALL NOT release keys to the CCR if the privacy budget has been exhausted.
- In the CCR, models SHALL Be trained using a differentially private training algorithm subject to privacy constraints specified in the contracts. For example, if a model is trained using DP-SGD, the clipping norm and noise shall be chosen based on privacy constraints.
Privacy Budget Management
In any differentially private mechanism, processing a dataset e.g., to train a model, consumes some privacy budget. The dataset cannot be used anyone once the allocated budget is exhausted. Therefore, privacy budget is a scarce resource that must be carefully managed. The DEPA training framework will provide TDPs and TDCs with guidance and tools to manage privacy budget correctly and judiciously.
Global vs TDC-specific budget allocation
TDPs may assign a privacy budget to each TDC, under the assumption that TDCs can be consistently identified over time, and that TDCs do not collude e.g., by sharing their models with each other. Using DIDs and registration time checks addresses the first concern. Addressing advertent or inadvertent TDC collusion (e.g., due to a merger) is harder to address.
We recommend that TDPs employ the following graded budget allocation policies to address this concern.
Sensitivity Level | Data Privacy Sensitivity | Solution | Notes |
---|---|---|---|
Level 0 | Non-sensitive or aggregate data. Large amount of data is available for training and more data keeps getting generated continuously | Only legal protection |
|
Level 1 | Highly sensitive data. Also the speed at which more such data gets generated is not too slow. | Renewable Global Privacy Budget |
|
Level 2 | Highly sensitive data which is critical for larger social good. Also the speed at which more such data gets generated is slow | Restrict trained model use to CCR, i.e. the trained model never leaves CCR |
|
The choice of budget allocation policy depends on the privacy sensitivity of the data. This will be a critical responsibility of SROs to agree upon the right level for each domain and dataset to best reflect the trade-offs between privacy of the data and the criticality of the use-cases for which the data is used.