Tag Archives: differential privacy

Differentially private clustering for large-scale datasets

Clustering is a central problem in unsupervised machine learning (ML) with many applications across domains in both industry and academic research more broadly. At its core, clustering consists of the following problem: given a set of data elements, the goal is to partition the data elements into groups such that similar objects are in the same group, while dissimilar objects are in different groups. This problem has been studied in math, computer science, operations research and statistics for more than 60 years in its myriad variants. Two common forms of clustering are metric clustering, in which the elements are points in a metric space, like in the k-means problem, and graph clustering, where the elements are nodes of a graph whose edges represent similarity among them.

In the k-means clustering problem, we are given a set of points in a metric space with the objective to identify k representative points, called centers (here depicted as triangles), so as to minimize the sum of the squared distances from each point to its closest center. Source, rights: CC-BY-SA-4.0

Despite the extensive literature on algorithm design for clustering, few practical works have focused on rigorously protecting the user's privacy during clustering. When clustering is applied to personal data (e.g., the queries a user has made), it is necessary to consider the privacy implications of using a clustering solution in a real system and how much information the output solution reveals about the input data.

To ensure privacy in a rigorous sense, one solution is to develop differentially private (DP) clustering algorithms. These algorithms ensure that the output of the clustering does not reveal private information about a specific data element (e.g., whether a user has made a given query) or sensitive data about the input graph (e.g., a relationship in a social network). Given the importance of privacy protections in unsupervised machine learning, in recent years Google has invested in research on theory and practice of differentially private metric or graph clustering, and differential privacy in a variety of contexts, e.g., heatmaps or tools to design DP algorithms.

Today we are excited to announce two important updates: 1) a new differentially-private algorithm for hierarchical graph clustering, which we’ll be presenting at ICML 2023, and 2) the open-source release of the code of a scalable differentially-private k-means algorithm. This code brings differentially private k-means clustering to large scale datasets using distributed computing. Here, we will also discuss our work on clustering technology for a recent launch in the health domain for informing public health authorities.


Differentially private hierarchical clustering

Hierarchical clustering is a popular clustering approach that consists of recursively partitioning a dataset into clusters at an increasingly finer granularity. A well known example of hierarchical clustering is the phylogenetic tree in biology in which all life on Earth is partitioned into finer and finer groups (e.g., kingdom, phylum, class, order, etc.). A hierarchical clustering algorithm receives as input a graph representing the similarity of entities and learns such recursive partitions in an unsupervised way. Yet at the time of our research no algorithm was known to compute hierarchical clustering of a graph with edge privacy, i.e., preserving the privacy of the vertex interactions.

In “Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees”, we consider how well the problem can be approximated in a DP context and establish firm upper and lower bounds on the privacy guarantee. We design an approximation algorithm (the first of its kind) with a polynomial running time that achieves both an additive error that scales with the number of nodes n (of order n2.5) and a multiplicative approximation of O(log½ n), with the multiplicative error identical to the non-private setting. We further provide a new lower bound on the additive error (of order n2) for any private algorithm (irrespective of its running time) and provide an exponential-time algorithm that matches this lower bound. Moreover, our paper includes a beyond-worst-case analysis focusing on the hierarchical stochastic block model, a standard random graph model that exhibits a natural hierarchical clustering structure, and introduces a private algorithm that returns a solution with an additive cost over the optimum that is negligible for larger and larger graphs, again matching the non-private state-of-the-art approaches. We believe this work expands the understanding of privacy preserving algorithms on graph data and will enable new applications in such settings.


Large-scale differentially private clustering

We now switch gears and discuss our work for metric space clustering. Most prior work in DP metric clustering has focused on improving the approximation guarantees of the algorithms on the k-means objective, leaving scalability questions out of the picture. Indeed, it is not clear how efficient non-private algorithms such as k-means++ or k-means// can be made differentially private without sacrificing drastically either on the approximation guarantees or the scalability. On the other hand, both scalability and privacy are of primary importance at Google. For this reason, we recently published multiple papers that address the problem of designing efficient differentially private algorithms for clustering that can scale to massive datasets. Our goal is, moreover, to offer scalability to large scale input datasets, even when the target number of centers, k, is large.

We work in the massively parallel computation (MPC) model, which is a computation model representative of modern distributed computation architectures. The model consists of several machines, each holding only part of the input data, that work together with the goal of solving a global problem while minimizing the amount of communication between machines. We present a differentially private constant factor approximation algorithm for k-means that only requires a constant number of rounds of synchronization. Our algorithm builds upon our previous work on the problem (with code available here), which was the first differentially-private clustering algorithm with provable approximation guarantees that can work in the MPC model.

The DP constant factor approximation algorithm drastically improves on the previous work using a two phase approach. In an initial phase it computes a crude approximation to “seed” the second phase, which consists of a more sophisticated distributed algorithm. Equipped with the first-step approximation, the second phase relies on results from the Coreset literature to subsample a relevant set of input points and find a good differentially private clustering solution for the input points. We then prove that this solution generalizes with approximately the same guarantee to the entire input.


Vaccination search insights via DP clustering

We then apply these advances in differentially private clustering to real-world applications. One example is our application of our differentially-private clustering solution for publishing COVID vaccine-related queries, while providing strong privacy protections for the users.

The goal of Vaccination Search Insights (VSI) is to help public health decision makers (health authorities, government agencies and nonprofits) identify and respond to communities' information needs regarding COVID vaccines. In order to achieve this, the tool allows users to explore at different geolocation granularities (zip-code, county and state level in the U.S.) the top themes searched by users regarding COVID queries. In particular, the tool visualizes statistics on trending queries rising in interest in a given locale and time.

Screenshot of the output of the tool. Displayed on the left, the top searches related to Covid vaccines during the period Oct 10-16 2022. On the right, the queries that have had rising importance during the same period and compared to the previous week.

To better help identifying the themes of the trending searches, the tool clusters the search queries based on their semantic similarity. This is done by applying a custom-designed k-means–based algorithm run over search data that has been anonymized using the DP Gaussian mechanism to add noise and remove low-count queries (thus resulting in a differentially clustering). The method ensures strong differential privacy guarantees for the protection of the user data.

This tool provided fine-grained data on COVID vaccine perception in the population at unprecedented scales of granularity, something that is especially relevant to understand the needs of the marginalized communities disproportionately affected by COVID. This project highlights the impact of our investment in research in differential privacy, and unsupervised ML methods. We are looking to other important areas where we can apply these clustering techniques to help guide decision making around global health challenges, like search queries on climate change–related challenges such as air quality or extreme heat.


Acknowledgements

We thank our co-authors Silvio Lattanzi, Vahab Mirrokni, Andres Munoz Medina, Shyam Narayanan, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii, Peilin Zhong and our colleagues from the Health AI team that made the VSI launch possible Shailesh Bavadekar, Adam Boulanger, Tague Griffith, Mansi Kansal, Chaitanya Kamath, Akim Kumok, Yael Mayer, Tomer Shekel, Megan Shum, Charlotte Stanton, Mimi Sun, Swapnil Vispute, and Mark Young.

For more information on the Graph Mining team (part of Algorithm and Optimization) visit our pages.

Source: Google AI Blog


More ways we’re making every day safer with Google

Every day, we focus on making sure you’re in control of your data by building products that are secure by default and private by design. At this year’s I/O, we’re introducing new features and technologies to keep you safer with Google


Putting you in control of your data 


Privacy is personal. That's why we make it easy for you to choose the settings that are right for you — whether that’s one place to manage settings in your Google Account, Auto-Delete options, or controls that appear in context when you’re using our products. We announced a number of new controls today: 


  • Quick delete in Search. We’re introducing a new, “quick delete” option to delete the last 15 minutes of your Search history with a single tap from the Google Account Menu. 

  • A passcode protected Locked Folder in Photos. Have you ever handed your phone to show someone a photo, but worried they might scroll to a personal or sensitive image — like a photo of your passport or a surprise gift? “Locked Folder” is a new feature in Google Photos —  a passcode-protected space where select photos can be saved separately. These photos won’t show up as you scroll through your grid or in shared albums. This feature is coming to Google Pixels first, and more Android devices throughout the year.

  • Location History reminders in your Maps Timeline. Now, when you see places you've visited in your Timeline, we'll remind you that it's because you turned on Location History — which you can easily turn off right there in your Timeline. 

1. New “quick delete” option in Search.  2. The new Locked Folder in Photos. 3. Location History reminders in your Maps Timeline. 


We’re also introducing new, industry-leading transparency and permission features on Android 12. The new OS includes a Privacy Dashboard where you will see a timeline of when apps accessed your camera, microphone, or device location. We’ve also added indicators that show when your camera or microphone are in use, as well as easy toggles to disable access to both across your device. And you can now choose to share your approximate location with an app instead of a precise one.  


Building products that are secure by default


As recent high-profile third-party security incidents show, your information isn’t private if it’s not secure. With AI-driven technologies that protect billions of users around the world, our products are secure by default: every day, we block 100 million phishing attempts and 15 billion spam messages in Gmail and encrypt 4 billion photos. And Safe Browsing on Chrome and most other browsers helps keep the rest of the Internet secure, automatically protecting more than 4 billion devices.


One of the biggest security risks is still the continued reliance on passwords — they’re often easy to crack, used across multiple sites, or stolen in phishing attacks. That’s why we’ve been working towards a password-free future — focusing on safer ways to authenticate your identity and building multiple layers of protection into your Google Account, like automatic enrollment in 2-step verification


But because passwords are still required for most online accounts, we’ve also continued to improve our Password Manager, built directly into Chrome, Android and now iOS, to help you create, remember, save and auto-fill passwords across the web. Today, we announced new enhancements to Password Manager:

  • A new tool that makes it easy to import passwords from other password managers

  • Deeper integrations with Chrome and Android to seamlessly fill your passwords across sites and apps, regardless of whether you’re on desktop or on mobile 

  • Password Alerts that automatically warn you if we detect one of your saved passwords has been compromised via a third party breach.

  • A smart way to fix compromised passwords in Chrome with a simple tap. For supported sites and apps, whenever Password Manager finds a password that may have been compromised, you’ll see a "change password" button from Assistant. When you tap the button, the Assistant will not only navigate to the site, but also go through the entire process of changing your password. This feature is available on Android devices and will be rolling out to more sites and apps in the future.


1. A new way to fix compromised passwords in Chrome. 2. A new tool to import passwords from other password managers to Password Manager. 3. Password Alerts. 


Making our products private by design


We’ve pioneered new computing technologies like Federated Learning (invented by Google researchers in 2016) that make it possible to deliver helpful experiences while protecting individual data and privacy. We’ve also led on Differential Privacy, which powers some of our most helpful features and products, from our COVID-19 Community Mobility Reports to traffic predictions in Maps, without revealing individual user data. And this expertise guides our work on broader industry initiatives, like the open-source Privacy Sandbox


Now, we’re continuing that work with Android's Private Compute Core, which keeps your information safe and private for a number of popular AI-driven features like Live Caption (which displays captions based on audio), Now Playing (which tells you the song that’s playing) and Smart Reply (which suggests short responses to messages and emails). For these features, the audio and language processing happens exclusively on your device. Like the rest of Android, Private Compute Core is open source — it’s fully inspectable and verifiable by the security community. 


We’ll continue our work to make every day safer with Google with new controls, advanced security, and privacy-preserving technologies.


Posted by Jen Fitzpatrick, SVP, Core


DEFCON Differential Privacy Training Launch

Differential privacy is a technique that enables organizations to learn from the majority of their data while simultaneously ensuring those results do not allow an individual’s data to be distinguished or re-identified. A popular way of attaining differential privacy is by adding noise to the data, which provides mathematical bounds on the amount of information that is leaked. Our open source offering aims to help developers implement differential privacy.

In the summer of 2019, we publicly launched our Differential Privacy Library. Since then, we’ve expanded it from just C++ to also include Go and Java.

We’ve come to realize that differential privacy requires more than just the library to be effectively implemented. We mentioned in a post earlier this summer that we want all developers to be able to interact with differential privacy, which requires more than an open-sourced library, but rather a training on the topic to share knowledge with all developers.

Our goal with this training is to provide a head start that is helpful for those considering differential privacy implementation. We also want to provide an experience on privacy and security that is understood and impactful to any individual in the field, whether they are a beginner or someone who has background knowledge in privacy.

This new training contains several steps and covers many topics, such as:
  • The foundations of differential privacy
  • Explanations as to why aggregation by itself may not hedge against privacy risks
  • The mathematical behind-the-scenes of noise
  • Tools that can be used in conjunction with differential privacy
  • Codelabs that users can take (in Go)
  • Additional resources to address any further questions

Step 1: Take our survey! It only takes five minutes!

This survey enables us to gain insights into what you are expecting to gain from this training. We are curious about what your objectives and goals are with this training, and if you have any experience with differential privacy.

Step 2: Check-out an introductory video to Differential Privacy!

We introduce topics like data aggregation, k-anonymity, differential privacy, noise, and others. The goal of this module is to introduce the foundations behind the differential privacy, and why it is an important and useful privacy tool.

Step 3: Try-out our codelabs

We have provided Codelabs in Go to help you practice implementing Differential Privacy library end-to-end.

Step 4: Learn more about differential privacy.

We want to offer an additional resource to help answer any questions you may have. If you have other resources that you find, please let us know and we will add these links to our overall training.

Step 5: Provide us with some feedback

Please use this survey as a platform to share your experience with this pilot. Did the content meet your expectations? Did it make sense? What was missing? This is the time for you to share your point of view and any pain points you experienced (as well as any positive aspects you encountered).

We hope this training provides an impactful experience from beginner coders to privacy specialists. The public differential privacy training will launch at the Stanford Biodesign: “Building for Digital Health” Buildathon, Sept 11-13, 2020, led by Stanford, and supported by Google Cloud and Apple Health engineers.

Please continue to reach out to us to share your experiences with us at [email protected]. The suggestions we receive will help us improve and it will inform our thinking as we add new features and updates.

Acknowledgements: Miguel Guevara, Bryant Gipson, Royce Wilson, Kate Frankenberg, Katie Holzheimer, Lior Gottleib, Carmen Bush

By Aditi Joshi – Security and Privacy Engineering, Google Cloud

Expanding our Differential Privacy Library

All developers have a responsibility to treat data with care and respect. Differential privacy helps organizations derive insights from data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This principled approach supports data computation and analysis across many of Google’s core products and features.

Last summer, Google open sourced our foundational differential privacy library so developers and organizations around the world can benefit from this technology. Today, we’re announcing the addition of Go and Java to our library, an end-to-end solution for differential privacy: Privacy on Beam, and new tools to help developers implement this technology effectively.

We’ve listened to feedback from our developer community and, as of today, developers can now perform differentially private analysis in Java and Go. We’re working to bring these two libraries to full feature parity with C++.

We want all developers to have access to differential privacy, regardless of their level of expertise. Our new Privacy on Beam framework captures years of Googler developer experience and efficiency improvements in a comprehensive and easy-to-use solution that handles computation end-to-end. Built on Apache Beam, Privacy on Beam can reduce implementation mistakes, and take care of all the steps that are essential to differential privacy, including noise addition, partition selection, and contribution bounding. If you’re new to Apache Beam or differential privacy, our codelab can get you started.

Tracking privacy budgets is another challenge developers face when implementing differential privacy. So, we’re also releasing a new Privacy Loss Distribution tool for tracking privacy budgets. With this tool, developers can maintain an accurate estimate of the total cost to user privacy for collections of differentially private queries, and better evaluate the overall impact of their pipelines. Privacy Loss Distribution supports widely used mechanisms (such as Laplace, Gaussian, and Randomized response) and can scale to hundreds of compositions.

We hope these new languages, tools, and features unlock differential privacy for even more developers. Continue to share your stories and suggestions with us at [email protected]—your feedback will help inform our future differential privacy launches and updates.

Acknowledgements

Software Engineers: Yurii Sushko, Daniel Simmons-Marengo, Christoph Dibak, Damien Desfontaines, Maria Telyatnikova
Research Scientists: Pasin Manurangsi, Ravi Kumar, Sergei Vassilvitskii, Alex Kulesza, Jenny Gillenwater, Kareem Amin


By: Miguel Guevara, Mirac Vuslat Basaran, Sasha Kulankhina, and Badih Ghazi – Google Privacy Team and Google Research

Enabling Developers and Organizations to Use Differential privacy

Originally posted on the Google Developers Blog
By: Miguel Guevara, Product Manager, Privacy and Data Protection Office


Whether you're a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users.

Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.

Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.

A deeper look at the technology

Our open source library was designed to meet the needs of developers. In addition to being freely accessible, we wanted it to be easy to deploy and useful. 

Here are some of the key features of the library:
  • Statistical functions: Most common data science operations are supported by this release. Developers can compute counts, sums, averages, medians, and percentiles using our library.
  • Rigorous testing: Getting differential privacy right is challenging. Besides an extensive test suite, we’ve included an extensible ‘Stochastic Differential Privacy Model Checker library’ to help prevent mistakes.
  • Ready to use: The real utility of an open-source release is in answering the question “Can I use this?” That’s why we’ve included a PostgreSQL extension along with common recipes to get you started. We’ve described the details of our approach in a technical paper that we’ve just released today.
  • Modular: We designed the library so that it can be extended to include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.

Investing in new privacy technologies

We have driven the research and development of practical, differentially-private techniques since we released RAPPOR to help improve Chrome in 2014, and continue to spearhead their real-world application. 

We’ve used differentially private methods to create helpful features in our products, like how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi.


This year, we’ve announced several open-source, privacy technologies—Tensorflow Privacy, Tensorflow Federated, Private Join and Compute—and today’s launch adds to this growing list. We're excited to make this library broadly available and hope developers will consider leveraging it as they build out their comprehensive data privacy strategies. From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone.

Acknowledgements
Software Engineers: Alain Forget, Bryant Gipson, Celia Zhang, Damien Desfontaines, Daniel Simmons-Marengo, Ian Pudney, Jin Fu, Michael Daub, Priyanka Sehgal, Royce Wilson, William Lam

Enabling developers and organizations to use differential privacy

Posted by Miguel Guevara, Product Manager, Privacy and Data Protection Office

Whether you're a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users.

Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.

Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.

A deeper look at the technology

Our open source library was designed to meet the needs of developers. In addition to being freely accessible, we wanted it to be easy to deploy and useful.

Here are some of the key features of the library:

  • Statistical functions: Most common data science operations are supported by this release. Developers can compute counts, sums, averages, medians, and percentiles using our library.
  • Rigorous testing: Getting differential privacy right is challenging. Besides an extensive test suite, we’ve included an extensible ‘Stochastic Differential Privacy Model Checker library’ to help prevent mistakes.
  • Ready to use: The real utility of an open-source release is in answering the question “Can I use this?” That’s why we’ve included a PostgreSQL extension along with common recipes to get you started. We’ve described the details of our approach in a technical paper that we’ve just released today.
  • Modular: We designed the library so that it can be extended to include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.

Investing in new privacy technologies

We have driven the research and development of practical, differentially-private techniques since we released RAPPOR to help improve Chrome in 2014, and continue to spearhead their real-world application.

We’ve used differentially private methods to create helpful features in our products, like how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi.

Screen recording on phone checking popular times of restaurant

This year, we’ve announced several open-source, privacy technologies—Tensorflow Privacy, Tensorflow Federated, Private Join and Compute—and today’s launch adds to this growing list. We're excited to make this library broadly available and hope developers will consider leveraging it as they build out their comprehensive data privacy strategies. From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone.

Acknowledgements

Software Engineers: Alain Forget, Bryant Gipson, Celia Zhang, Damien Desfontaines, Daniel Simmons-Marengo, Ian Pudney, Jin Fu, Michael Daub, Priyanka Sehgal, Royce Wilson, William Lam