August 16, 2022

Creating AI models in the age of data privacy

Caroline Zaborowski

For many years, big tech companies were able to take advantage of the naivety of their users. People would willingly share almost any personal data with them in exchange for “free” services. Even now, we see reports of embedded browsers being used to silently collect user data. The reason is simple. Your personal data is hugely valuable for these companies. It allows them to build amazingly detailed profiles of your interests, political views, shopping habits, and so on. In turn, they can sell these profiles to advertisers or other organizations who want to target certain demographics. 

So, how is this relevant to AI? Well, increasingly, users have become aware of the value of their personal data. Moreover, governments everywhere have started to legislate to protect this data. Most famously, the EU introduced the General Data Protection Regulation (GDPR) back in 2018. Recently, several US States have followed suit, including California, Utah, and Colorado. Since AI relies on data, any laws that make data harder to access will have a direct impact. 

Now, don’t get me wrong. Data privacy is inherently a good thing. All our systems are designed to protect user data and keep it safe. This is essential since we work with financial and health companies among others. The real problem comes when legislation gets over-interpreted, blocking access to data even though the data subject might be perfectly happy to share it. 

Can you have too much privacy?

To understand what I mean, imagine the following scenario. You work for a pharma company who developed a widely used drug to treat a serious illness. Recent media reports are now linking this drug to serious problems. During the trials, the drug was used to treat hundreds of patients and only had a small number of adverse side effects. However, your company has to address the negative publicity and find out what is going on. So, how do you go about testing whether these reports are true? 

One way would be to manually contact as many doctors as possible and ask them to submit anonymous details of any patients that may have had bad side effects. This data would be patchy at best and might require thousands of man hours of manual processing before you can identify any patterns. Doctors may be too busy to share the data, or they may choose not to on patient confidentiality grounds.

A better way would be to get direct access to patient health records. You could then extract details of all patients prescribed the drug and check whether they suffered any side effects. This sort of anomaly detection is an ideal use case for AI. However, there is a huge problem. Medical records are among the most confidential personal data there is. Getting access to just one hospital’s records would be a legal and logistical nightmare and could take months. That would then have to be repeated for thousands or even tens of thousands of hospitals. 

Some possible solutions

The above is a purely hypothetical situation and, in theory, is meant to be addressed by systems for reporting adverse drug reactions. However, it highlights an inherent conflict in data privacy legislation. Data subjects are probably thankful that their health data isn’t shared without permission. However, if they thought there was a problem with the drug they are taking, most would be completely OK with sharing their data. Now, you might think an obvious solution is to ask people to sign catch-all consents that allow their data to be shared in such circumstances. However, legally, such consents are typically not allowed. Moreover, data protection officers often take a very conservative view of data sharing even where consent has been given. Fortunately, there are some technical solutions to this kind of problem.

Anonymization and de-identification

The traditional approach to solve this problem is to process the data so that it is no longer linked to any individual data subject. Within the US, HIPAA (the Health Insurance Portability and Accountability Act) defines a process of de-identification after which data can be freely shared. There is a process known as the “Safe Harbor” approach whereby a specific list of identifiers must be removed from the data in order to render it deidentified. Within the EU, the standard for anonymization is tougher. You must be able to prove that there is no way that any individual can be re-identified from the data. Unfortunately, it is rather easy to re-identify people using AI

Synthetic data

A more recent approach is known as synthetic data. This is a direct alternative to anonymization. Essentially, you use AI to create fake data subjects that are statistically identical to the real people. In other words, any features or correlations that exist in the real data will also exist in the synthetic data. But the data is completely unrelated to any real person. This approach can help solve some of the problems above, but it comes with its own set of drawbacks. Firstly, it can be hard to create good quality synthetic data. Secondly, if you are interested in a relatively rare feature in the data, the resulting synthetic data may not be all that anonymous. Finally, creating the synthetic data can only happen once you have a copy of the real data. Thus, it either has to happen locally or you need permission to transfer the data to another location.

Federated learning

The final potential solution is known as federated learning. This is one of the newest fields in machine learning and AI. In effect, rather than train a single model using all the data, you train a model at each site where the data resides. For instance, in each hospital. These child models are then combined to create a complete model that performs better than any individual model. This approach is already being used by some companies to overcome issues such as commercial sensitivity. There are also research projects like AICCELERATE exploring whether it can be applied to health data too.

Final words

Data privacy is definitely a challenge for some AI use cases. However, AI itself can offer viable solutions, such as synthetic data. But data privacy is a constant battle with researchers applying more and more powerful AI techniques to undermine anonymization or other privacy-preserving techniques. The upshot is that often you are better off solving the data privacy problem legally rather than bypassing it. And my final advice? Don’t overlook the importance of privacy when building AI solutions. But also, don’t assume your only solution to the privacy problem is technical. 

White Paper

SAIBRE AI Ecosystem

End-to-end AI applications that solve any business problem