Preserving privacy within a population: differential privacy
In this article, I will present the definition of differential privacy and preserving privacy and personal data of users while using their data in training machine learning models or driving insights using data science technologies.
What is differential privacy?
Differential privacy describes a promise, made by a data holder, or curator, to a data subject:
''You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis,no matter what other studies, data sets, or information sources, are available.''
At their best, differentially private database mechanisms can make confidential data widely available for accurate data analysis, without resorting to data clean rooms, data usage agreements, data protection plans, or restricted views.
Nonetheless, data utility will eventually be consumed: the Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy in a spectacular way.
Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population.
A medical database may teach us that smoking causes cancer, affecting an insurance company’s view of a smoker’s long-term medical costs.
Has the smoker been harmed by the analysis?
Perhaps — his insurance premiums may rise, if the insurer knows he smokes. He may also be helped — learning of his health risks, he enters a smoking cessation program.
Has the smoker’s privacy been compromised?
It is certainly the case that more is known about him after the study than was known before, but was his information “leaked”?
Differential privacy will take the view that it was not, with the rationale that the impact on the smoker is the same independent of whether or not he was in the study. It is the conclusions reached in the study that affect the smoker, not his presence or absence in the data set
Differential privacy ensures that the same conclusions, for example, smoking causes cancer, will be reached, independent of whether any individual opts into or opts out of the data set.
Artificial Intelligence and the privacy paradox
Consider an institution, e.g. the National Institutes of Health, the Census Bureau, or a social networking company, in possession of dataset containing sensitive information about individuals. For example, the dataset may consist of medical records, socioeconomic attributes, or geolocation data. The institution faces an important tradeoff when deciding how to make this dataset available for statistical analysis.
On one hand, if the institution releases the dataset (or at least statistical information about it), it can enable important research and eventually inform policy decisions.
On the other hand, for a number of ethical and legal reasons it is important to protect the individual-level privacy of the data subjects. The field of privacy-preserving data analysis aims to reconcile these two objectives. That is, it seeks to enable rich statistical analyses on sensitive datasets while protecting the privacy of the individuals who contributed to them.
Differential privacy and Machine Learning
One of the most useful tasks in data analysis is machine learning: the problem of automatically finding a simple rule to accurately predict certain unknown characteristics of never before seen data.
Many machine learning tasks can be performed under the constraint of differential privacy. In fact, the constraint of privacy is not necessarily at odds with the goals of machine learning, both of which aim to extract information from the distribution from which the data was drawn, rather than from individual data points.
The goal in machine learning is very often similar to the goal in private data analysis. The learner typically wishes to learn some simple rule that explains a data set. However, she wishes this rule to generalize — that is, it should be that the rule she learns not only correctly describes the data that she has on hand, but that it should also be able to correctly describe new data that is drawn from the same distribution.
Generally, this means that she wants to learn a rule that captures distributional information about the data set on hand, in away that does not depend too specifically on any single data point.
Of course, this is exactly the goal of private data analysis — to reveal distributional information about the private data set, without revealing too much about any single individual in the dataset (you remember the over-fitting phenomena?).
It should come as no surprise then that machine learning and private data analysis are closely linked. In fact, as we will see, we are often able to perform private machine learning nearly as accurately, with nearly the same number of examples as we can perform non-private machine learning.
Cryptography and privacy
Some recent work has focused on machine learning or general computation over encrypted data.
Recently, Google deployed a new system for assembling a deep learning model form thousands of locally-learned models while preserving privacy, which they call Federated Learning.
Differential privacy should not be seen as a limitation in any context. However, we should look at it as a privacy-dog watching our compliance with standards that handles the sensitive data. We generate data more than what we think and we leave digital footprint everywhere, thus; as researchers in machine learning and data science, we should focus more on this topic and find a fair trade-off between privacy and accurate models.