Tag Archives: differential privacy

Advances in private training for production on-device language models

Language models (LMs) trained to predict the next word given input text are the key technology for many applications [1, 2]. In Gboard, LMs are used to improve users’ typing experience by supporting features like next word prediction (NWP), Smart Compose, smart completion and suggestion, slide to type, and proofread. Deploying models on users’ devices rather than enterprise servers has advantages like lower latency and better privacy for model usage. While training on-device models directly from user data effectively improves the utility performance for applications such as NWP and smart text selection, protecting the privacy of user data for model training is important.

Gboard features powered by on-device language models.

In this blog we discuss how years of research advances now power the private training of Gboard LMs, since the proof-of-concept development of federated learning (FL) in 2017 and formal differential privacy (DP) guarantees in 2022. FL enables mobile phones to collaboratively learn a model while keeping all the training data on device, and DP provides a quantifiable measure of data anonymization. Formally, DP is often characterized by (ε, δ) with smaller values representing stronger guarantees. Machine learning (ML) models are considered to have reasonable DP guarantees for ε=10 and strong DP guarantees for ε=1 when δ is small.

As of today, all NWP neural network LMs in Gboard are trained with FL with formal DP guarantees, and all future launches of Gboard LMs trained on user data require DP. These 30+ Gboard on-device LMs are launched in 7+ languages and 15+ countries, and satisfy (ɛ, δ)-DP guarantees of small δ of 10-10 and ɛ between 0.994 and 13.69. To the best of our knowledge, this is the largest known deployment of user-level DP in production at Google or anywhere, and the first time a strong DP guarantee of ɛ < 1 is announced for models trained directly on user data.


Privacy principles and practices in Gboard

In “Private Federated Learning in Gboard”, we discussed how different privacy principles are currently reflected in production models, including:

  • Transparency and user control: We provide disclosure of what data is used, what purpose it is used for, how it is processed in various channels, and how Gboard users can easily configure the data usage in learning models.
  • Data minimization: FL immediately aggregates only focused updates that improve a specific model. Secure aggregation (SecAgg) is an encryption method to further guarantee that only aggregated results of the ephemeral updates can be accessed.
  • Data anonymization: DP is applied by the server to prevent models from memorizing the unique information in individual user’s training data.
  • Auditability and verifiability: We have made public the key algorithmic approaches and privacy accounting in open-sourced code (TFF aggregator, TFP DPQuery, DP accounting, and FL system).

A brief history

In recent years, FL has become the default method for training Gboard on-device LMs from user data. In 2020, a DP mechanism that clips and adds noise to model updates was used to prevent memorization for training the Spanish LM in Spain, which satisfies finite DP guarantees (Tier 3 described in “How to DP-fy ML“ guide). In 2022, with the help of the DP-Follow-The-Regularized-Leader (DP-FTRL) algorithm, the Spanish LM became the first production neural network trained directly on user data announced with a formal DP guarantee of (ε=8.9, δ=10-10)-DP (equivalent to the reported ρ=0.81 zero-Concentrated-Differential-Privacy), and therefore satisfies reasonable privacy guarantees (Tier 2).


Differential privacy by default in federated learning

In “Federated Learning of Gboard Language Models with Differential Privacy”, we announced that all the NWP neural network LMs in Gboard have DP guarantees, and all future launches of Gboard LMs trained on user data require DP guarantees. DP is enabled in FL by applying the following practices:

  • Pre-train the model with the multilingual C4 dataset.
  • Via simulation experiments on public datasets, find a large DP-noise-to-signal ratio that allows for high utility. Increasing the number of clients contributing to one round of model update improves privacy while keeping the noise ratio fixed for good utility, up to the point the DP target is met, or the maximum allowed by the system and the size of the population.
  • Configure the parameter to restrict the frequency each client can contribute (e.g., once every few days) based on computation budget and estimated population in the FL system.
  • Run DP-FTRL training with limits on the magnitude of per-device updates chosen either via adaptive clipping, or fixed based on experience.

SecAgg can be additionally applied by adopting the advances in improving computation and communication for scales and sensitivity.

Federated learning with differential privacy and (SecAgg).

Reporting DP guarantees

The DP guarantees of launched Gboard NWP LMs are visualized in the barplot below. The x-axis shows LMs labeled by language-locale and trained on corresponding populations; the y-axis shows the ε value when δ is fixed to a small value of 10-10 for (ε, δ)-DP (lower is better). The utility of these models are either significantly better than previous non-neural models in production, or comparable with previous LMs without DP, measured based on user-interactions metrics during A/B testing. For example, by applying the best practices, the DP guarantee of the Spanish model in Spain is improved from ε=8.9 to ε=5.37. SecAgg is additionally used for training the Spanish model in Spain and English model in the US. More details of the DP guarantees are reported in the appendix following the guidelines outlined in “How to DP-fy ML”.


Towards stronger DP guarantees

The ε~10 DP guarantees of many launched LMs are already considered reasonable for ML models in practice, while the journey of DP FL in Gboard continues for improving user typing experience while protecting data privacy. We are excited to announce that, for the first time, production LMs of Portuguese in Brazil and Spanish in Latin America are trained and launched with a DP guarantee of ε ≤ 1, which satisfies Tier 1 strong privacy guarantees. Specifically, the (ε=0.994, δ=10-10)-DP guarantee is achieved by running the advanced Matrix Factorization DP-FTRL (MF-DP-FTRL) algorithm, with 12,000+ devices participating in every training round of server model update larger than the common setting of 6500+ devices, and a carefully configured policy to restrict each client to at most participate twice in the total 2000 rounds of training in 14 days in the large Portuguese user population of Brazil. Using a similar setting, the es-US Spanish LM was trained in a large population combining multiple countries in Latin America to achieve (ε=0.994, δ=10-10)-DP. The ε ≤ 1 es-US model significantly improved the utility in many countries, and launched in Colombia, Ecuador, Guatemala, Mexico, and Venezuela. For the smaller population in Spain, the DP guarantee of es-ES LM is improved from ε=5.37 to ε=3.42 by only replacing DP-FTRL with MF-DP-FTRL without increasing the number of devices participating every round. More technical details are disclosed in the colab for privacy accounting.

DP guarantees for Gboard NWP LMs (the purple bar represents the first es-ES launch of ε=8.9; cyan bars represent privacy improvements for models trained with MF-DP-FTRL; tiers are from “How to DP-fy ML“ guide; en-US* and es-ES* are additionally trained with SecAgg).

Discussion and next steps

Our experience suggests that DP can be achieved in practice through system algorithm co-design on client participation, and that both privacy and utility can be strong when populations are large and a large number of devices' contributions are aggregated. Privacy-utility-computation trade-offs can be improved by using public data, the new MF-DP-FTRL algorithm, and tightening accounting. With these techniques, a strong DP guarantee of ε ≤ 1 is possible but still challenging. Active research on empirical privacy auditing [1, 2] suggests that DP models are potentially more private than the worst-case DP guarantees imply. While we keep pushing the frontier of algorithms, which dimension of privacy-utility-computation should be prioritized?

We are actively working on all privacy aspects of ML, including extending DP-FTRL to distributed DP and improving auditability and verifiability. Trusted Execution Environment opens the opportunity for substantially increasing the model size with verifiable privacy. The recent breakthrough in large LMs (LLMs) motivates us to rethink the usage of public information in private training and more future interactions between LLMs, on-device LMs, and Gboard production.


Acknowledgments

The authors would like to thank Peter Kairouz, Brendan McMahan, and Daniel Ramage for their early feedback on the blog post itself, Shaofeng Li and Tom Small for helping with the animated figures, and the teams at Google that helped with algorithm design, infrastructure implementation, and production maintenance. The collaborators below directly contribute to the presented results:

Research and algorithm development: Galen Andrew, Stanislav Chiknavaryan, Christopher A. Choquette-Choo, Arun Ganesh, Peter Kairouz, Ryan McKenna, H. Brendan McMahan, Jesse Rosenstock, Timon Van Overveldt, Keith Rush, Shuang Song, Thomas Steinke, Abhradeep Guha Thakurta, Om Thakkar, and Yuanbo Zhang.

Infrastructure, production and leadership support: Mingqing Chen, Stefan Dierauf, Billy Dou, Hubert Eichner, Zachary Garrett, Jeremy Gillula, Jianpeng Hou, Hui Li, Xu Liu, Wenzhi Mao, Brett McLarnon, Mengchen Pei, Daniel Ramage, Swaroop Ramaswamy, Haicheng Sun, Andreas Terzis, Yun Wang, Shanshan Wu, Yu Xiao, and Shumin Zhai.

Source: Google AI Blog


DP-Auditorium: A flexible library for auditing differential privacy

Differential privacy (DP) is a property of randomized mechanisms that limit the influence of any individual user’s information while processing and analyzing data. DP offers a robust solution to address growing concerns about data protection, enabling technologies across industries and government applications (e.g., the US census) without compromising individual user identities. As its adoption increases, it’s important to identify the potential risks of developing mechanisms with faulty implementations. Researchers have recently found errors in the mathematical proofs of private mechanisms, and their implementations. For example, researchers compared six sparse vector technique (SVT) variations and found that only two of the six actually met the asserted privacy guarantee. Even when mathematical proofs are correct, the code implementing the mechanism is vulnerable to human error.

However, practical and efficient DP auditing is challenging primarily due to the inherent randomness of the mechanisms and the probabilistic nature of the tested guarantees. In addition, a range of guarantee types exist, (e.g., pure DP, approximate DP, Rényi DP, and concentrated DP), and this diversity contributes to the complexity of formulating the auditing problem. Further, debugging mathematical proofs and code bases is an intractable task given the volume of proposed mechanisms. While ad hoc testing techniques exist under specific assumptions of mechanisms, few efforts have been made to develop an extensible tool for testing DP mechanisms.

To that end, in “DP-Auditorium: A Large Scale Library for Auditing Differential Privacy”, we introduce an open source library for auditing DP guarantees with only black-box access to a mechanism (i.e., without any knowledge of the mechanism’s internal properties). DP-Auditorium is implemented in Python and provides a flexible interface that allows contributions to continuously improve its testing capabilities. We also introduce new testing algorithms that perform divergence optimization over function spaces for Rényi DP, pure DP, and approximate DP. We demonstrate that DP-Auditorium can efficiently identify DP guarantee violations, and suggest which tests are most suitable for detecting particular bugs under various privacy guarantees.


DP guarantees

The output of a DP mechanism is a sample drawn from a probability distribution (M (D)) that satisfies a mathematical property ensuring the privacy of user data. A DP guarantee is thus tightly related to properties between pairs of probability distributions. A mechanism is differentially private if the probability distributions determined by M on dataset D and a neighboring dataset D’, which differ by only one record, are indistinguishable under a given divergence metric.

For example, the classical approximate DP definition states that a mechanism is approximately DP with parameters (ε, δ) if the hockey-stick divergence of order eε, between M(D) and M(D’), is at most δ. Pure DP is a special instance of approximate DP where δ = 0. Finally, a mechanism is considered Rényi DP with parameters (𝛼, ε) if the Rényi divergence of order 𝛼, is at most ε (where ε is a small positive value). In these three definitions, ε is not interchangeable but intuitively conveys the same concept; larger values of ε imply larger divergences between the two distributions or less privacy, since the two distributions are easier to distinguish.


DP-Auditorium

DP-Auditorium comprises two main components: property testers and dataset finders. Property testers take samples from a mechanism evaluated on specific datasets as input and aim to identify privacy guarantee violations in the provided datasets. Dataset finders suggest datasets where the privacy guarantee may fail. By combining both components, DP-Auditorium enables (1) automated testing of diverse mechanisms and privacy definitions and, (2) detection of bugs in privacy-preserving mechanisms. We implement various private and non-private mechanisms, including simple mechanisms that compute the mean of records and more complex mechanisms, such as different SVT and gradient descent mechanism variants.

Property testers determine if evidence exists to reject the hypothesis that a given divergence between two probability distributions, P and Q, is bounded by a prespecified budget determined by the DP guarantee being tested. They compute a lower bound from samples from P and Q, rejecting the property if the lower bound value exceeds the expected divergence. No guarantees are provided if the result is indeed bounded. To test for a range of privacy guarantees, DP-Auditorium introduces three novel testers: (1) HockeyStickPropertyTester, (2) RényiPropertyTester, and (3) MMDPropertyTester. Unlike other approaches, these testers don’t depend on explicit histogram approximations of the tested distributions. They rely on variational representations of the hockey-stick divergence, Rényi divergence, and maximum mean discrepancy (MMD) that enable the estimation of divergences through optimization over function spaces. As a baseline, we implement HistogramPropertyTester, a commonly used approximate DP tester. While our three testers follow a similar approach, for brevity, we focus on the HockeyStickPropertyTester in this post.

Given two neighboring datasets, D and D’, the HockeyStickPropertyTester finds a lower bound,^δ  for the hockey-stick divergence between M(D) and M(D’) that holds with high probability. Hockey-stick divergence enforces that the two distributions M(D) and M(D’) are close under an approximate DP guarantee. Therefore, if a privacy guarantee claims that the hockey-stick divergence is at most δ, and^δ  > δ, then with high probability the divergence is higher than what was promised on D and D’ and the mechanism cannot satisfy the given approximate DP guarantee. The lower bound^δ  is computed as an empirical and tractable counterpart of a variational formulation of the hockey-stick divergence (see the paper for more details). The accuracy of^δ  increases with the number of samples drawn from the mechanism, but decreases as the variational formulation is simplified. We balance these factors in order to ensure that^δ  is both accurate and easy to compute.

Dataset finders use black-box optimization to find datasets D and D’ that maximize^δ, a lower bound on the divergence value δ. Note that black-box optimization techniques are specifically designed for settings where deriving gradients for an objective function may be impractical or even impossible. These optimization techniques oscillate between exploration and exploitation phases to estimate the shape of the objective function and predict areas where the objective can have optimal values. In contrast, a full exploration algorithm, such as the grid search method, searches over the full space of neighboring datasets D and D’. DP-Auditorium implements different dataset finders through the open sourced black-box optimization library Vizier.

Running existing components on a new mechanism only requires defining the mechanism as a Python function that takes an array of data D and a desired number of samples n to be output by the mechanism computed on D. In addition, we provide flexible wrappers for testers and dataset finders that allow practitioners to implement their own testing and dataset search algorithms.


Key results

We assess the effectiveness of DP-Auditorium on five private and nine non-private mechanisms with diverse output spaces. For each property tester, we repeat the test ten times on fixed datasets using different values of ε, and report the number of times each tester identifies privacy bugs. While no tester consistently outperforms the others, we identify bugs that would be missed by previous techniques (HistogramPropertyTester). Note that the HistogramPropertyTester is not applicable to SVT mechanisms.

Number of times each property tester finds the privacy violation for the tested non-private mechanisms. NonDPLaplaceMean and NonDPGaussianMean mechanisms are faulty implementations of the Laplace and Gaussian mechanisms for computing the mean.

We also analyze the implementation of a DP gradient descent algorithm (DP-GD) in TensorFlow that computes gradients of the loss function on private data. To preserve privacy, DP-GD employs a clipping mechanism to bound the l2-norm of the gradients by a value G, followed by the addition of Gaussian noise. This implementation incorrectly assumes that the noise added has a scale of G, while in reality, the scale is sG, where s is a positive scalar. This discrepancy leads to an approximate DP guarantee that holds only for values of s greater than or equal to 1.

We evaluate the effectiveness of property testers in detecting this bug and show that HockeyStickPropertyTester and RényiPropertyTester exhibit superior performance in identifying privacy violations, outperforming MMDPropertyTester and HistogramPropertyTester. Notably, these testers detect the bug even for values of s as high as 0.6. It is worth highlighting that s = 0.5 corresponds to a common error in literature that involves missing a factor of two when accounting for the privacy budget ε. DP-Auditorium successfully captures this bug as shown below. For more details see section 5.6 here.

Estimated divergences and test thresholds for different values of s when testing DP-GD with the HistogramPropertyTester (left) and the HockeyStickPropertyTester (right).

Estimated divergences and test thresholds for different values of s when testing DP-GD with the RényiPropertyTester (left) and the MMDPropertyTester (right)

To test dataset finders, we compute the number of datasets explored before finding a privacy violation. On average, the majority of bugs are discovered in less than 10 calls to dataset finders. Randomized and exploration/exploitation methods are more efficient at finding datasets than grid search. For more details, see the paper.


Conclusion

DP is one of the most powerful frameworks for data protection. However, proper implementation of DP mechanisms can be challenging and prone to errors that cannot be easily detected using traditional unit testing methods. A unified testing framework can help auditors, regulators, and academics ensure that private mechanisms are indeed private.

DP-Auditorium is a new approach to testing DP via divergence optimization over function spaces. Our results show that this type of function-based estimation consistently outperforms previous black-box access testers. Finally, we demonstrate that these function-based estimators allow for a better discovery rate of privacy bugs compared to histogram estimation. By open sourcing DP-Auditorium, we aim to establish a standard for end-to-end testing of new differentially private algorithms.


Acknowledgements

The work described here was done jointly with Andrés Muñoz Medina, William Kong and Umar Syed. We thank Chris Dibak and Vadym Doroshenko for helpful engineering support and interface suggestions for our library.

Source: Google AI Blog


Sparsity-preserving differentially private training

Large embedding models have emerged as a fundamental tool for various applications in recommendation systems [1, 2] and natural language processing [3, 4, 5]. Such models enable the integration of non-numerical data into deep learning models by mapping categorical or string-valued input attributes with large vocabularies to fixed-length representation vectors using embedding layers. These models are widely deployed in personalized recommendation systems and achieve state-of-the-art performance in language tasks, such as language modeling, sentiment analysis, and question answering. In many such scenarios, privacy is an equally important feature when deploying those models. As a result, various techniques have been proposed to enable private data analysis. Among those, differential privacy (DP) is a widely adopted definition that limits exposure of individual user information while still allowing for the analysis of population-level patterns.

For training deep neural networks with DP guarantees, the most widely used algorithm is DP-SGD (DP stochastic gradient descent). One key component of DP-SGD is adding Gaussian noise to every coordinate of the gradient vectors during training. However, this creates scalability challenges when applied to large embedding models, because they rely on gradient sparsity for efficient training, but adding noise to all the coordinates destroys sparsity.

To mitigate this gradient sparsity problem, in “Sparsity-Preserving Differentially Private Training of Large Embedding Models” (to be presented at NeurIPS 2023), we propose a new algorithm called adaptive filtering-enabled sparse training (DP-AdaFEST). At a high level, the algorithm maintains the sparsity of the gradient by selecting only a subset of feature rows to which noise is added at each iteration. The key is to make such selections differentially private so that a three-way balance is achieved among the privacy cost, the training efficiency, and the model utility. Our empirical evaluation shows that DP-AdaFEST achieves a substantially sparser gradient, with a reduction in gradient size of over 105X compared to the dense gradient produced by standard DP-SGD, while maintaining comparable levels of accuracy. This gradient size reduction could translate into 20X wall-clock time improvement.


Overview

To better understand the challenges and our solutions to the gradient sparsity problem, let us start with an overview of how DP-SGD works during training. As illustrated by the figure below, DP-SGD operates by clipping the gradient contribution from each example in the current random subset of samples (called a mini-batch), and adding coordinate-wise Gaussian noise to the average gradient during each iteration of stochastic gradient descent (SGD). DP-SGD has demonstrated its effectiveness in protecting user privacy while maintaining model utility in a variety of applications [6, 7].

An illustration of how DP-SGD works. During each training step, a mini-batch of examples is sampled, and used to compute the per-example gradients. Those gradients are processed through clipping, aggregation and summation of Gaussian noise to produce the final privatized gradients.

The challenges of applying DP-SGD to large embedding models mainly come from 1) the non-numerical feature fields like user/product IDs and categories, and 2) words and tokens that are transformed into dense vectors through an embedding layer. Due to the vocabulary sizes of those features, the process requires large embedding tables with a substantial number of parameters. In contrast to the number of parameters, the gradient updates are usually extremely sparse because each mini-batch of examples only activates a tiny fraction of embedding rows (the figure below visualizes the ratio of zero-valued coordinates, i.e., the sparsity, of the gradients under various batch sizes). This sparsity is heavily leveraged for industrial applications that efficiently handle the training of large-scale embeddings. For example, Google Cloud TPUs, custom-designed AI accelerators that are optimized for training and inference of large AI models, have dedicated APIs to handle large embeddings with sparse updates. This leads to significantly improved training throughput compared to training on GPUs, which at this time did not have specialized optimization for sparse embedding lookups. On the other hand, DP-SGD completely destroys the gradient sparsity because it requires adding independent Gaussian noise to all the coordinates. This creates a road block for private training of large embedding models as the training efficiency would be significantly reduced compared to non-private training.

Embedding gradient sparsity (the fraction of zero-value gradient coordinates) in the Criteo pCTR model (see below). The figure reports the gradient sparsity, averaged over 50 update steps, of the top five categorical features (out of a total of 26) with the highest number of buckets, as well as the sparsity of all categorical features. The sprasity decreases with the batch size as more examples hit more rows in the embedding table, creating non-zero gradients. However, the sparsity is above 0.97 even for very large batch sizes. This pattern is consistently observed for all the five features.

Algorithm

Our algorithm is built by extending standard DP-SGD with an extra mechanism at each iteration to privately select the “hot features”, which are the features that are activated by multiple training examples in the current mini-batch. As illustrated below, the mechanism works in a few steps:

  1. Compute how many examples contributed to each feature bucket (we call each of the possible values of a categorical feature a “bucket”).
  2. Restrict the total contribution from each example by clipping their counts.
  3. Add Gaussian noise to the contribution count of each feature bucket.
  4. Select only the features to be included in the gradient update that have a count above a given threshold (a sparsity-controlling parameter), thus maintaining sparsity. This mechanism is differentially private, and the privacy cost can be easily computed by composing it with the standard DP-SGD iterations.
Illustration of the process of the algorithm on a synthetic categorical feature that has 20 buckets. We compute the number of examples contributing to each bucket, adjust the value based on per-example total contributions (including those to other features), add Gaussian noise, and retain only those buckets with a noisy contribution exceeding the threshold for (noisy) gradient update.

Theoretical motivation

We provide the theoretical motivation that underlies DP-AdaFEST by viewing it as optimization using stochastic gradient oracles. Standard analysis of stochastic gradient descent in a theoretical setting decomposes the test error of the model into “bias” and “variance” terms. The advantage of DP-AdaFEST can be viewed as reducing variance at the cost of slightly increasing the bias. This is because DP-AdaFEST adds noise to a smaller set of coordinates compared to DP-SGD, which adds noise to all the coordinates. On the other hand, DP-AdaFEST introduces some bias to the gradients since the gradient on the embedding features are dropped with some probability. We refer the interested reader to Section 3.4 of the paper for more details.


Experiments

We evaluate the effectiveness of our algorithm with large embedding model applications, on public datasets, including one ad prediction dataset (Criteo-Kaggle) and one language understanding dataset (SST-2). We use DP-SGD with exponential selection as a baseline comparison.

The effectiveness of DP-AdaFEST is evident in the figure below, where it achieves significantly higher gradient size reduction (i.e., gradient sparsity) than the baseline while maintaining the same level of utility (i.e., only minimal performance degradation).

Specifically, on the Criteo-Kaggle dataset, DP-AdaFEST reduces the gradient computation cost of regular DP-SGD by more than 5x105 times while maintaining a comparable AUC (which we define as a loss of less than 0.005). This reduction translates into a more efficient and cost-effective training process. In comparison, as shown by the green line below, the baseline method is not able to achieve reasonable cost reduction within such a small utility loss threshold.

In language tasks, there isn't as much potential for reducing the size of gradients, because the vocabulary used is often smaller and already quite compact (shown on the right below). However, the adoption of sparsity-preserving DP-SGD effectively obviates the dense gradient computation. Furthermore, in line with the bias-variance trade-off presented in the theoretical analysis, we note that DP-AdaFEST occasionally exhibits superior utility compared to DP-SGD when the reduction in gradient size is minimal. Conversely, when incorporating sparsity, the baseline algorithm faces challenges in maintaining utility.

A comparison of the best gradient size reduction (the ratio of the non-zero gradient value counts between regular DP-SGD and sparsity-preserving algorithms) achieved under ε =1.0 by DP-AdaFEST (our algorithm) and the baseline algorithm (DP-SGD with exponential selection) compared to DP-SGD at different thresholds for utility difference. A higher curve indicates a better utility/efficiency trade-off.

In practice, most ad prediction models are being continuously trained and evaluated. To simulate this online learning setup, we also evaluate with time-series data, which are notoriously challenging due to being non-stationary. Our evaluation uses the Criteo-1TB dataset, which comprises real-world user-click data collected over 24 days. Consistently, DP-AdaFEST reduces the gradient computation cost of regular DP-SGD by more than 104 times while maintaining a comparable AUC.

A comparison of the best gradient size reduction achieved under ε =1.0 by DP-AdaFEST (our algorithm) and DP-SGD with exponential selection (a previous algorithm) compared to DP-SGD at different thresholds for utility difference. A higher curve indicates a better utility/efficiency trade-off. DP-AdaFEST consistently outperforms the previous method.

Conclusion

We present a new algorithm, DP-AdaFEST, for preserving gradient sparsity in differentially private training — particularly in applications involving large embedding models, a fundamental tool for various applications in recommendation systems and natural language processing. Our algorithm achieves significant reductions in gradient size while maintaining accuracy on real-world benchmark datasets. Moreover, it offers flexible options for balancing utility and efficiency via sparsity-controlling parameters, while our proposals offer much better privacy-utility loss.


Acknowledgements

This work was a collaboration with Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi and Amer Sinha.

Source: Google AI Blog


Announcing the first Machine Unlearning Challenge

Deep learning has recently driven tremendous progress in a wide array of applications, ranging from realistic image generation and impressive retrieval systems to language models that can hold human-like conversations. While this progress is very exciting, the widespread use of deep neural network models requires caution: as guided by Google’s AI Principles, we seek to develop AI technologies responsibly by understanding and mitigating potential risks, such as the propagation and amplification of unfair biases and protecting user privacy.

Fully erasing the influence of the data requested to be deleted is challenging since, aside from simply deleting it from databases where it’s stored, it also requires erasing the influence of that data on other artifacts such as trained machine learning models. Moreover, recent research [1, 2] has shown that in some cases it may be possible to infer with high accuracy whether an example was used to train a machine learning model using membership inference attacks (MIAs). This can raise privacy concerns, as it implies that even if an individual's data is deleted from a database, it may still be possible to infer whether that individual's data was used to train a model.

Given the above, machine unlearning is an emergent subfield of machine learning that aims to remove the influence of a specific subset of training examples — the "forget set" — from a trained model. Furthermore, an ideal unlearning algorithm would remove the influence of certain examples while maintaining other beneficial properties, such as the accuracy on the rest of the train set and generalization to held-out examples. A straightforward way to produce this unlearned model is to retrain the model on an adjusted training set that excludes the samples from the forget set. However, this is not always a viable option, as retraining deep models can be computationally expensive. An ideal unlearning algorithm would instead use the already-trained model as a starting point and efficiently make adjustments to remove the influence of the requested data.

Today we're thrilled to announce that we've teamed up with a broad group of academic and industrial researchers to organize the first Machine Unlearning Challenge. The competition considers a realistic scenario in which after training, a certain subset of the training images must be forgotten to protect the privacy or rights of the individuals concerned. The competition will be hosted on Kaggle, and submissions will be automatically scored in terms of both forgetting quality and model utility. We hope that this competition will help advance the state of the art in machine unlearning and encourage the development of efficient, effective and ethical unlearning algorithms.


Machine unlearning applications

Machine unlearning has applications beyond protecting user privacy. For instance, one can use unlearning to erase inaccurate or outdated information from trained models (e.g., due to errors in labeling or changes in the environment) or remove harmful, manipulated, or outlier data.

The field of machine unlearning is related to other areas of machine learning such as differential privacy, life-long learning, and fairness. Differential privacy aims to guarantee that no particular training example has too large an influence on the trained model; a stronger goal compared to that of unlearning, which only requires erasing the influence of the designated forget set. Life-long learning research aims to design models that can learn continuously while maintaining previously-acquired skills. As work on unlearning progresses, it may also open additional ways to boost fairness in models, by correcting unfair biases or disparate treatment of members belonging to different groups (e.g., demographics, age groups, etc.).

Anatomy of unlearning. An unlearning algorithm takes as input a pre-trained model and one or more samples from the train set to unlearn (the "forget set"). From the model, forget set, and retain set, the unlearning algorithm produces an updated model. An ideal unlearning algorithm produces a model that is indistinguishable from the model trained without the forget set.

Challenges of machine unlearning

The problem of unlearning is complex and multifaceted as it involves several conflicting objectives: forgetting the requested data, maintaining the model’s utility (e.g., accuracy on retained and held-out data), and efficiency. Because of this, existing unlearning algorithms make different trade-offs. For example, full retraining achieves successful forgetting without damaging model utility, but with poor efficiency, while adding noise to the weights achieves forgetting at the expense of utility.

Furthermore, the evaluation of forgetting algorithms in the literature has so far been highly inconsistent. While some works report the classification accuracy on the samples to unlearn, others report distance to the fully retrained model, and yet others use the error rate of membership inference attacks as a metric for forgetting quality [4, 5, 6].

We believe that the inconsistency of evaluation metrics and the lack of a standardized protocol is a serious impediment to progress in the field — we are unable to make direct comparisons between different unlearning methods in the literature. This leaves us with a myopic view of the relative merits and drawbacks of different approaches, as well as open challenges and opportunities for developing improved algorithms. To address the issue of inconsistent evaluation and to advance the state of the art in the field of machine unlearning, we've teamed up with a broad group of academic and industrial researchers to organize the first unlearning challenge.


Announcing the first Machine Unlearning Challenge

We are pleased to announce the first Machine Unlearning Challenge, which will be held as part of the NeurIPS 2023 Competition Track. The goal of the competition is twofold. First, by unifying and standardizing the evaluation metrics for unlearning, we hope to identify the strengths and weaknesses of different algorithms through apples-to-apples comparisons. Second, by opening this competition to everyone, we hope to foster novel solutions and shed light on open challenges and opportunities.

The competition will be hosted on Kaggle and run between mid-July 2023 and mid-September 2023. As part of the competition, today we're announcing the availability of the starting kit. This starting kit provides a foundation for participants to build and test their unlearning models on a toy dataset.

The competition considers a realistic scenario in which an age predictor has been trained on face images, and, after training, a certain subset of the training images must be forgotten to protect the privacy or rights of the individuals concerned. For this, we will make available as part of the starting kit a dataset of synthetic faces (samples shown below) and we'll also use several real-face datasets for evaluation of submissions. The participants are asked to submit code that takes as input the trained predictor, the forget and retain sets, and outputs the weights of a predictor that has unlearned the designated forget set. We will evaluate submissions based on both the strength of the forgetting algorithm and model utility. We will also enforce a hard cut-off that rejects unlearning algorithms that run slower than a fraction of the time it takes to retrain. A valuable outcome of this competition will be to characterize the trade-offs of different unlearning algorithms.

Excerpt images from the Face Synthetics dataset together with age annotations. The competition considers the scenario in which an age predictor has been trained on face images like the above, and, after training, a certain subset of the training images must be forgotten.

For evaluating forgetting, we will use tools inspired by MIAs, such as LiRA. MIAs were first developed in the privacy and security literature and their goal is to infer which examples were part of the training set. Intuitively, if unlearning is successful, the unlearned model contains no traces of the forgotten examples, causing MIAs to fail: the attacker would be unable to infer that the forget set was, in fact, part of the original training set. In addition, we will also use statistical tests to quantify how different the distribution of unlearned models (produced by a particular submitted unlearning algorithm) is compared to the distribution of models retrained from scratch. For an ideal unlearning algorithm, these two will be indistinguishable.


Conclusion

Machine unlearning is a powerful tool that has the potential to address several open problems in machine learning. As research in this area continues, we hope to see new methods that are more efficient, effective, and responsible. We are thrilled to have the opportunity via this competition to spark interest in this field, and we are looking forward to sharing our insights and findings with the community.


Acknowledgements

The authors of this post are now part of Google DeepMind. We are writing this blog post on behalf of the organization team of the Unlearning Competition: Eleni Triantafillou*, Fabian Pedregosa* (*equal contribution), Meghdad Kurmanji, Kairan Zhao, Gintare Karolina Dziugaite, Peter Triantafillou, Ioannis Mitliagkas, Vincent Dumoulin, Lisheng Sun Hosoya, Peter Kairouz, Julio C. S. Jacques Junior, Jun Wan, Sergio Escalera and Isabelle Guyon.

Source: Google AI Blog


Differentially private clustering for large-scale datasets

Clustering is a central problem in unsupervised machine learning (ML) with many applications across domains in both industry and academic research more broadly. At its core, clustering consists of the following problem: given a set of data elements, the goal is to partition the data elements into groups such that similar objects are in the same group, while dissimilar objects are in different groups. This problem has been studied in math, computer science, operations research and statistics for more than 60 years in its myriad variants. Two common forms of clustering are metric clustering, in which the elements are points in a metric space, like in the k-means problem, and graph clustering, where the elements are nodes of a graph whose edges represent similarity among them.

In the k-means clustering problem, we are given a set of points in a metric space with the objective to identify k representative points, called centers (here depicted as triangles), so as to minimize the sum of the squared distances from each point to its closest center. Source, rights: CC-BY-SA-4.0

Despite the extensive literature on algorithm design for clustering, few practical works have focused on rigorously protecting the user's privacy during clustering. When clustering is applied to personal data (e.g., the queries a user has made), it is necessary to consider the privacy implications of using a clustering solution in a real system and how much information the output solution reveals about the input data.

To ensure privacy in a rigorous sense, one solution is to develop differentially private (DP) clustering algorithms. These algorithms ensure that the output of the clustering does not reveal private information about a specific data element (e.g., whether a user has made a given query) or sensitive data about the input graph (e.g., a relationship in a social network). Given the importance of privacy protections in unsupervised machine learning, in recent years Google has invested in research on theory and practice of differentially private metric or graph clustering, and differential privacy in a variety of contexts, e.g., heatmaps or tools to design DP algorithms.

Today we are excited to announce two important updates: 1) a new differentially-private algorithm for hierarchical graph clustering, which we’ll be presenting at ICML 2023, and 2) the open-source release of the code of a scalable differentially-private k-means algorithm. This code brings differentially private k-means clustering to large scale datasets using distributed computing. Here, we will also discuss our work on clustering technology for a recent launch in the health domain for informing public health authorities.


Differentially private hierarchical clustering

Hierarchical clustering is a popular clustering approach that consists of recursively partitioning a dataset into clusters at an increasingly finer granularity. A well known example of hierarchical clustering is the phylogenetic tree in biology in which all life on Earth is partitioned into finer and finer groups (e.g., kingdom, phylum, class, order, etc.). A hierarchical clustering algorithm receives as input a graph representing the similarity of entities and learns such recursive partitions in an unsupervised way. Yet at the time of our research no algorithm was known to compute hierarchical clustering of a graph with edge privacy, i.e., preserving the privacy of the vertex interactions.

In “Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees”, we consider how well the problem can be approximated in a DP context and establish firm upper and lower bounds on the privacy guarantee. We design an approximation algorithm (the first of its kind) with a polynomial running time that achieves both an additive error that scales with the number of nodes n (of order n2.5) and a multiplicative approximation of O(log½ n), with the multiplicative error identical to the non-private setting. We further provide a new lower bound on the additive error (of order n2) for any private algorithm (irrespective of its running time) and provide an exponential-time algorithm that matches this lower bound. Moreover, our paper includes a beyond-worst-case analysis focusing on the hierarchical stochastic block model, a standard random graph model that exhibits a natural hierarchical clustering structure, and introduces a private algorithm that returns a solution with an additive cost over the optimum that is negligible for larger and larger graphs, again matching the non-private state-of-the-art approaches. We believe this work expands the understanding of privacy preserving algorithms on graph data and will enable new applications in such settings.


Large-scale differentially private clustering

We now switch gears and discuss our work for metric space clustering. Most prior work in DP metric clustering has focused on improving the approximation guarantees of the algorithms on the k-means objective, leaving scalability questions out of the picture. Indeed, it is not clear how efficient non-private algorithms such as k-means++ or k-means// can be made differentially private without sacrificing drastically either on the approximation guarantees or the scalability. On the other hand, both scalability and privacy are of primary importance at Google. For this reason, we recently published multiple papers that address the problem of designing efficient differentially private algorithms for clustering that can scale to massive datasets. Our goal is, moreover, to offer scalability to large scale input datasets, even when the target number of centers, k, is large.

We work in the massively parallel computation (MPC) model, which is a computation model representative of modern distributed computation architectures. The model consists of several machines, each holding only part of the input data, that work together with the goal of solving a global problem while minimizing the amount of communication between machines. We present a differentially private constant factor approximation algorithm for k-means that only requires a constant number of rounds of synchronization. Our algorithm builds upon our previous work on the problem (with code available here), which was the first differentially-private clustering algorithm with provable approximation guarantees that can work in the MPC model.

The DP constant factor approximation algorithm drastically improves on the previous work using a two phase approach. In an initial phase it computes a crude approximation to “seed” the second phase, which consists of a more sophisticated distributed algorithm. Equipped with the first-step approximation, the second phase relies on results from the Coreset literature to subsample a relevant set of input points and find a good differentially private clustering solution for the input points. We then prove that this solution generalizes with approximately the same guarantee to the entire input.


Vaccination search insights via DP clustering

We then apply these advances in differentially private clustering to real-world applications. One example is our application of our differentially-private clustering solution for publishing COVID vaccine-related queries, while providing strong privacy protections for the users.

The goal of Vaccination Search Insights (VSI) is to help public health decision makers (health authorities, government agencies and nonprofits) identify and respond to communities' information needs regarding COVID vaccines. In order to achieve this, the tool allows users to explore at different geolocation granularities (zip-code, county and state level in the U.S.) the top themes searched by users regarding COVID queries. In particular, the tool visualizes statistics on trending queries rising in interest in a given locale and time.

Screenshot of the output of the tool. Displayed on the left, the top searches related to Covid vaccines during the period Oct 10-16 2022. On the right, the queries that have had rising importance during the same period and compared to the previous week.

To better help identifying the themes of the trending searches, the tool clusters the search queries based on their semantic similarity. This is done by applying a custom-designed k-means–based algorithm run over search data that has been anonymized using the DP Gaussian mechanism to add noise and remove low-count queries (thus resulting in a differentially clustering). The method ensures strong differential privacy guarantees for the protection of the user data.

This tool provided fine-grained data on COVID vaccine perception in the population at unprecedented scales of granularity, something that is especially relevant to understand the needs of the marginalized communities disproportionately affected by COVID. This project highlights the impact of our investment in research in differential privacy, and unsupervised ML methods. We are looking to other important areas where we can apply these clustering techniques to help guide decision making around global health challenges, like search queries on climate change–related challenges such as air quality or extreme heat.


Acknowledgements

We thank our co-authors Silvio Lattanzi, Vahab Mirrokni, Andres Munoz Medina, Shyam Narayanan, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii, Peilin Zhong and our colleagues from the Health AI team that made the VSI launch possible Shailesh Bavadekar, Adam Boulanger, Tague Griffith, Mansi Kansal, Chaitanya Kamath, Akim Kumok, Yael Mayer, Tomer Shekel, Megan Shum, Charlotte Stanton, Mimi Sun, Swapnil Vispute, and Mark Young.

For more information on the Graph Mining team (part of Algorithm and Optimization) visit our pages.

Source: Google AI Blog


More ways we’re making every day safer with Google

Every day, we focus on making sure you’re in control of your data by building products that are secure by default and private by design. At this year’s I/O, we’re introducing new features and technologies to keep you safer with Google


Putting you in control of your data 


Privacy is personal. That's why we make it easy for you to choose the settings that are right for you — whether that’s one place to manage settings in your Google Account, Auto-Delete options, or controls that appear in context when you’re using our products. We announced a number of new controls today: 


  • Quick delete in Search. We’re introducing a new, “quick delete” option to delete the last 15 minutes of your Search history with a single tap from the Google Account Menu. 

  • A passcode protected Locked Folder in Photos. Have you ever handed your phone to show someone a photo, but worried they might scroll to a personal or sensitive image — like a photo of your passport or a surprise gift? “Locked Folder” is a new feature in Google Photos —  a passcode-protected space where select photos can be saved separately. These photos won’t show up as you scroll through your grid or in shared albums. This feature is coming to Google Pixels first, and more Android devices throughout the year.

  • Location History reminders in your Maps Timeline. Now, when you see places you've visited in your Timeline, we'll remind you that it's because you turned on Location History — which you can easily turn off right there in your Timeline. 

1. New “quick delete” option in Search.  2. The new Locked Folder in Photos. 3. Location History reminders in your Maps Timeline. 


We’re also introducing new, industry-leading transparency and permission features on Android 12. The new OS includes a Privacy Dashboard where you will see a timeline of when apps accessed your camera, microphone, or device location. We’ve also added indicators that show when your camera or microphone are in use, as well as easy toggles to disable access to both across your device. And you can now choose to share your approximate location with an app instead of a precise one.  


Building products that are secure by default


As recent high-profile third-party security incidents show, your information isn’t private if it’s not secure. With AI-driven technologies that protect billions of users around the world, our products are secure by default: every day, we block 100 million phishing attempts and 15 billion spam messages in Gmail and encrypt 4 billion photos. And Safe Browsing on Chrome and most other browsers helps keep the rest of the Internet secure, automatically protecting more than 4 billion devices.


One of the biggest security risks is still the continued reliance on passwords — they’re often easy to crack, used across multiple sites, or stolen in phishing attacks. That’s why we’ve been working towards a password-free future — focusing on safer ways to authenticate your identity and building multiple layers of protection into your Google Account, like automatic enrollment in 2-step verification


But because passwords are still required for most online accounts, we’ve also continued to improve our Password Manager, built directly into Chrome, Android and now iOS, to help you create, remember, save and auto-fill passwords across the web. Today, we announced new enhancements to Password Manager:

  • A new tool that makes it easy to import passwords from other password managers

  • Deeper integrations with Chrome and Android to seamlessly fill your passwords across sites and apps, regardless of whether you’re on desktop or on mobile 

  • Password Alerts that automatically warn you if we detect one of your saved passwords has been compromised via a third party breach.

  • A smart way to fix compromised passwords in Chrome with a simple tap. For supported sites and apps, whenever Password Manager finds a password that may have been compromised, you’ll see a "change password" button from Assistant. When you tap the button, the Assistant will not only navigate to the site, but also go through the entire process of changing your password. This feature is available on Android devices and will be rolling out to more sites and apps in the future.


1. A new way to fix compromised passwords in Chrome. 2. A new tool to import passwords from other password managers to Password Manager. 3. Password Alerts. 


Making our products private by design


We’ve pioneered new computing technologies like Federated Learning (invented by Google researchers in 2016) that make it possible to deliver helpful experiences while protecting individual data and privacy. We’ve also led on Differential Privacy, which powers some of our most helpful features and products, from our COVID-19 Community Mobility Reports to traffic predictions in Maps, without revealing individual user data. And this expertise guides our work on broader industry initiatives, like the open-source Privacy Sandbox


Now, we’re continuing that work with Android's Private Compute Core, which keeps your information safe and private for a number of popular AI-driven features like Live Caption (which displays captions based on audio), Now Playing (which tells you the song that’s playing) and Smart Reply (which suggests short responses to messages and emails). For these features, the audio and language processing happens exclusively on your device. Like the rest of Android, Private Compute Core is open source — it’s fully inspectable and verifiable by the security community. 


We’ll continue our work to make every day safer with Google with new controls, advanced security, and privacy-preserving technologies.


Posted by Jen Fitzpatrick, SVP, Core


DEFCON Differential Privacy Training Launch

Differential privacy is a technique that enables organizations to learn from the majority of their data while simultaneously ensuring those results do not allow an individual’s data to be distinguished or re-identified. A popular way of attaining differential privacy is by adding noise to the data, which provides mathematical bounds on the amount of information that is leaked. Our open source offering aims to help developers implement differential privacy.

In the summer of 2019, we publicly launched our Differential Privacy Library. Since then, we’ve expanded it from just C++ to also include Go and Java.

We’ve come to realize that differential privacy requires more than just the library to be effectively implemented. We mentioned in a post earlier this summer that we want all developers to be able to interact with differential privacy, which requires more than an open-sourced library, but rather a training on the topic to share knowledge with all developers.

Our goal with this training is to provide a head start that is helpful for those considering differential privacy implementation. We also want to provide an experience on privacy and security that is understood and impactful to any individual in the field, whether they are a beginner or someone who has background knowledge in privacy.

This new training contains several steps and covers many topics, such as:
  • The foundations of differential privacy
  • Explanations as to why aggregation by itself may not hedge against privacy risks
  • The mathematical behind-the-scenes of noise
  • Tools that can be used in conjunction with differential privacy
  • Codelabs that users can take (in Go)
  • Additional resources to address any further questions

Step 1: Take our survey! It only takes five minutes!

This survey enables us to gain insights into what you are expecting to gain from this training. We are curious about what your objectives and goals are with this training, and if you have any experience with differential privacy.

Step 2: Check-out an introductory video to Differential Privacy!

We introduce topics like data aggregation, k-anonymity, differential privacy, noise, and others. The goal of this module is to introduce the foundations behind the differential privacy, and why it is an important and useful privacy tool.

Step 3: Try-out our codelabs

We have provided Codelabs in Go to help you practice implementing Differential Privacy library end-to-end.

Step 4: Learn more about differential privacy.

We want to offer an additional resource to help answer any questions you may have. If you have other resources that you find, please let us know and we will add these links to our overall training.

Step 5: Provide us with some feedback

Please use this survey as a platform to share your experience with this pilot. Did the content meet your expectations? Did it make sense? What was missing? This is the time for you to share your point of view and any pain points you experienced (as well as any positive aspects you encountered).

We hope this training provides an impactful experience from beginner coders to privacy specialists. The public differential privacy training will launch at the Stanford Biodesign: “Building for Digital Health” Buildathon, Sept 11-13, 2020, led by Stanford, and supported by Google Cloud and Apple Health engineers.

Please continue to reach out to us to share your experiences with us at [email protected]. The suggestions we receive will help us improve and it will inform our thinking as we add new features and updates.

Acknowledgements: Miguel Guevara, Bryant Gipson, Royce Wilson, Kate Frankenberg, Katie Holzheimer, Lior Gottleib, Carmen Bush

By Aditi Joshi – Security and Privacy Engineering, Google Cloud

Expanding our Differential Privacy Library

All developers have a responsibility to treat data with care and respect. Differential privacy helps organizations derive insights from data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This principled approach supports data computation and analysis across many of Google’s core products and features.

Last summer, Google open sourced our foundational differential privacy library so developers and organizations around the world can benefit from this technology. Today, we’re announcing the addition of Go and Java to our library, an end-to-end solution for differential privacy: Privacy on Beam, and new tools to help developers implement this technology effectively.

We’ve listened to feedback from our developer community and, as of today, developers can now perform differentially private analysis in Java and Go. We’re working to bring these two libraries to full feature parity with C++.

We want all developers to have access to differential privacy, regardless of their level of expertise. Our new Privacy on Beam framework captures years of Googler developer experience and efficiency improvements in a comprehensive and easy-to-use solution that handles computation end-to-end. Built on Apache Beam, Privacy on Beam can reduce implementation mistakes, and take care of all the steps that are essential to differential privacy, including noise addition, partition selection, and contribution bounding. If you’re new to Apache Beam or differential privacy, our codelab can get you started.

Tracking privacy budgets is another challenge developers face when implementing differential privacy. So, we’re also releasing a new Privacy Loss Distribution tool for tracking privacy budgets. With this tool, developers can maintain an accurate estimate of the total cost to user privacy for collections of differentially private queries, and better evaluate the overall impact of their pipelines. Privacy Loss Distribution supports widely used mechanisms (such as Laplace, Gaussian, and Randomized response) and can scale to hundreds of compositions.

We hope these new languages, tools, and features unlock differential privacy for even more developers. Continue to share your stories and suggestions with us at [email protected]—your feedback will help inform our future differential privacy launches and updates.

Acknowledgements

Software Engineers: Yurii Sushko, Daniel Simmons-Marengo, Christoph Dibak, Damien Desfontaines, Maria Telyatnikova
Research Scientists: Pasin Manurangsi, Ravi Kumar, Sergei Vassilvitskii, Alex Kulesza, Jenny Gillenwater, Kareem Amin


By: Miguel Guevara, Mirac Vuslat Basaran, Sasha Kulankhina, and Badih Ghazi – Google Privacy Team and Google Research

Enabling Developers and Organizations to Use Differential privacy

Originally posted on the Google Developers Blog
By: Miguel Guevara, Product Manager, Privacy and Data Protection Office


Whether you're a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users.

Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.

Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.

A deeper look at the technology

Our open source library was designed to meet the needs of developers. In addition to being freely accessible, we wanted it to be easy to deploy and useful. 

Here are some of the key features of the library:
  • Statistical functions: Most common data science operations are supported by this release. Developers can compute counts, sums, averages, medians, and percentiles using our library.
  • Rigorous testing: Getting differential privacy right is challenging. Besides an extensive test suite, we’ve included an extensible ‘Stochastic Differential Privacy Model Checker library’ to help prevent mistakes.
  • Ready to use: The real utility of an open-source release is in answering the question “Can I use this?” That’s why we’ve included a PostgreSQL extension along with common recipes to get you started. We’ve described the details of our approach in a technical paper that we’ve just released today.
  • Modular: We designed the library so that it can be extended to include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.

Investing in new privacy technologies

We have driven the research and development of practical, differentially-private techniques since we released RAPPOR to help improve Chrome in 2014, and continue to spearhead their real-world application. 

We’ve used differentially private methods to create helpful features in our products, like how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi.


This year, we’ve announced several open-source, privacy technologies—Tensorflow Privacy, Tensorflow Federated, Private Join and Compute—and today’s launch adds to this growing list. We're excited to make this library broadly available and hope developers will consider leveraging it as they build out their comprehensive data privacy strategies. From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone.

Acknowledgements
Software Engineers: Alain Forget, Bryant Gipson, Celia Zhang, Damien Desfontaines, Daniel Simmons-Marengo, Ian Pudney, Jin Fu, Michael Daub, Priyanka Sehgal, Royce Wilson, William Lam

Enabling developers and organizations to use differential privacy

Posted by Miguel Guevara, Product Manager, Privacy and Data Protection Office

Whether you're a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users.

Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual's data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner.

Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.

A deeper look at the technology

Our open source library was designed to meet the needs of developers. In addition to being freely accessible, we wanted it to be easy to deploy and useful.

Here are some of the key features of the library:

  • Statistical functions: Most common data science operations are supported by this release. Developers can compute counts, sums, averages, medians, and percentiles using our library.
  • Rigorous testing: Getting differential privacy right is challenging. Besides an extensive test suite, we’ve included an extensible ‘Stochastic Differential Privacy Model Checker library’ to help prevent mistakes.
  • Ready to use: The real utility of an open-source release is in answering the question “Can I use this?” That’s why we’ve included a PostgreSQL extension along with common recipes to get you started. We’ve described the details of our approach in a technical paper that we’ve just released today.
  • Modular: We designed the library so that it can be extended to include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management.

Investing in new privacy technologies

We have driven the research and development of practical, differentially-private techniques since we released RAPPOR to help improve Chrome in 2014, and continue to spearhead their real-world application.

We’ve used differentially private methods to create helpful features in our products, like how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi.

Screen recording on phone checking popular times of restaurant

This year, we’ve announced several open-source, privacy technologies—Tensorflow Privacy, Tensorflow Federated, Private Join and Compute—and today’s launch adds to this growing list. We're excited to make this library broadly available and hope developers will consider leveraging it as they build out their comprehensive data privacy strategies. From medicine, to government, to business, and beyond, it’s our hope that these open-source tools will help produce insights that benefit everyone.

Acknowledgements

Software Engineers: Alain Forget, Bryant Gipson, Celia Zhang, Damien Desfontaines, Daniel Simmons-Marengo, Ian Pudney, Jin Fu, Michael Daub, Priyanka Sehgal, Royce Wilson, William Lam