When Ivan Medvedev joined Google as a privacy engineering manager in 2013, the company had rogue data anxiety. Its user base and set of services had become so massive that it seemed inevitable that sensitive data could accidentally crop up in unexpected places, like customers filing support tickets with more personal information than necessary.
So Medvedev worked with colleagues on Google’s privacy team to develop an internal tool that could scan large amounts of data and automatically home in on identifying information or other sensitive data. Whether it was an old tax form accidentally captured in a photo or patient data embedded in the pixels of an ultrasound, the team designed the tool to find the unexpected.
“It should not be misunderstood as a comprehensive, privacy-proofed solution in itself.”
Lukasz Olejnik, Oxford University
That internal tool became a full cloud privacy service, called Data Loss Prevention, in 2017. It runs not only in numerous Google products, including all of GSuite, but also offers an application programming interface that lets administrators use it outside of Google’s ecosystem. At the Google Cloud Next conference Wednesday in San Francisco, DLP is expanding further, introducing a new user interface that makes it easier to use the privacy tool without technical expertise.
“In order to really protect something you need to know where it is, what it is, and how it’s handled,” Medvedev says. “If you really know what you’re doing there’s all this flexibility in DLP, but you don’t have to be a privacy pro to get use out of this.”
DLP leans on Google’s extensive machine learning capabilities—image recognition and machine vision, natural language processing, and context analysis all come into play—to seek out overlooked or unexpected sensitive data and automatically redact it. And while the Data Loss Prevention API can be customized based on specific types of data an administrator wants to catch—like patient information in a medical setting, or credit card numbers in a business—DLP also needs to be comprehensive enough to catch things organizations don’t know they’re looking for.
“Maybe in a customer support chat the agent says, ‘Can you give me the last four digits of your Social Security number?’ but the customer is excited and trying to help and sends the whole thing,” says Scott Ellis, a Google Cloud product manager. “DPL could be set up to apply masking before the agent even sees the number and before the business stores it. Or maybe you don’t want the agent to see it, but you want to collect it. It can be customized for different cases.”
All data evaluated by DLP runs through the platform’s API, whether it’s gigabytes or terabytes of information. Google says that it never logs or stores any data, but DLP is too resource-intensive to run locally. And for Google Cloud Platform customers this is less of a consideration anyway, since they already store their data with the company.
Ellis says that DLP’s main goals are classification of sensitive data, particularly identifying data, and thorough masking and deidentification, so that data can still be used for things like research or analysis without creating a privacy risk to individuals. The platform also analyzes risk for large quantities of data, and flags potentially problematic aberrations.
Ambra Health, a patient data and records sharing firm, has been working with Google on DLP’s use in medical data applications, specifically large-scale research. The company says that it has needed to bring specialized expertise to customize DLP for its use cases, but that the foundation is there.
“If you can get this data, deidentify it, and bring it against other data sets that you have, you can make advancements more rapidly,” Ambra CEO Morris Panner says. “But you need to mask it to comply with the law and be respectful. We couldn’t do that without this kind of tooling that enables HIPPA compliance and strong privacy.”
Though not every company is facilitating massive medical studies, DLP can also be helpful for general ass-covering—with real potential benefits to users. Misconfigurations in cloud platforms that lead to unintentionally exposed data continue to represent a major societal privacy issue. But a company that has redacted its data with DLP will at least avoid leaking identifiable information if its cloud administrators make an error in setting up data access controls.
Perspective remains important; DLP isn’t a panacea for data privacy. “Automatic redaction is a good thing to have, but might not always be very versatile beyond the most common cases,” says Lukasz Olejnik, an independent security and privacy adviser and research associate at the Center for Technology and Global Affairs at Oxford University. “DLP gives some edge on that, though, and it’s surely an asset in compliance. But it should not be misunderstood as a comprehensive, privacy-proofed solution in itself.”
But DLP’s new user interface will at least make it easier for small businesses or other organizations without extensive IT resources to get some data de-identification benefits.
“It’s challenging, you’ll never find everything,” Ellis says. “But the ability to mask this data and then do risk analysis and say ‘what else did we not find that might be a statistical outlier?’ That’s really important.”