Monthly Archives: February 2022

Define and manage Chat Spaces, space descriptions, and guidelines with a new “manager” role

What’s changing 


We're introducing several improvements for Spaces in Google Chat to help you better organize people, topics, and projects. 

These improvements include: 

  • Manager role so certain users have greater control over the management of a space. 
  • Space descriptions so users can set context for their spaces. 
  • Space guidelines to ensure safe and effective communication environments. 


Who’s impacted 

End users 

Why it’s important 

The Manager role provides tools to promote healthy conversations and control the availability of the space within an organization. Space creators will be managers by default and can assign other members in the space this role as well. 


Space managers will have a badge next to their name on the member list

 Space managers will have a badge next to their name on the member list


Managers can also add a description for spaces. This field can be used to describe the purpose of the space, such as ’a place to discuss all things asteroids,’ which is helpful context for members of the space.

You can add a space description when creating a space or by selecting “View space details” for an existing space on both web and mobile. Space descriptions can be viewed when a user is in the “Browse Spaces” view or by selecting “View space details.”


Adding a space decription


Adding a space description


Additionally, Managers can describe space guidelines that establish rules and expectations for members to create a safer community experience. 


Space guidelines

Space guidelines


We hope this feature makes it easier to share the purpose and guidelines for a particular space, making it easier for your collaborators to navigate quickly to the appropriate space. 


Getting started 


Rollout pace 

Space Roles: 

Mobile: 

  • Rapid and Scheduled Release domains: Extended rollout (potentially longer than 15 days for feature visibility) starting on February 28, 2022, with anticipated completion by March 14, 2022 

Web: 

Space Descriptions and Guidelines: 

  • We anticipate rollout for the descriptions and guidelines features to begin later this month. We’ll share an update on the Workspace Updates Blog when rollout begins. 


Availability 

  • Available to all Google Workspace customers, as well as G Suite Basic and Business customers 
  • Available to users with personal Google Accounts 

Resources 

Roadmap 

This feature was listed as an upcoming release.

Chrome Beta for Android Update

Hi everyone! We've just released Chrome Beta 99 (99.0.4844.48) for Android: it's now available on Google Play.

Chrome Beta for Android UpdateYou can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.

If you find a new issue, please let us know by filing a bug.

Ben Mason
Google Chrome

Announcing v202202 of the Google Ad Manager API

We're pleased to announce that v202202 of the Google Ad Manager API is available starting today, February 28th. This release brings a long-awaited feature – Ad Manager hosted video creatives. Simply provide a URL for the video asset, and Ad Manager will read and transcode it.

While API support has been added, this feature is still rolling out to Ad Manager networks and may not yet be available in yours. Networks with this feature disabled will continue to receive CreativeSetError.CANNOT_CREATE_OR_UPDATE_VIDEO_CREATIVES.

MCM companies now include information about the status of onboarding tasks and Sellers.json identifiers.

There are also some maintenance updates removing deprecated features, like AdExclusionRuleService and Proposal pause information.

For the full list of changes, check the release notes. Feel free to reach out to us on the Ad Manager API forum with any API-related questions.

Federated Learning with Formal Differential Privacy Guarantees

In 2017, Google introduced federated learning (FL), an approach that enables mobile devices to collaboratively train machine learning (ML) models while keeping the raw training data on each user's device, decoupling the ability to do ML from the need to store the data in the cloud. Since its introduction, Google has continued to actively engage in FL research and deployed FL to power many features in Gboard, including next word prediction, emoji suggestion and out-of-vocabulary word discovery. Federated learning is improving the “Hey Google” detection models in Assistant, suggesting replies in Google Messages, predicting text selections, and more.

While FL allows ML without raw data collection, differential privacy (DP) provides a quantifiable measure of data anonymization, and when applied to ML can address concerns about models memorizing sensitive user data. This too has been a top research priority, and has yielded one of the first production uses of DP for analytics with RAPPOR in 2014, our open-source DP library, Pipeline DP, and TensorFlow Privacy.

Through a multi-year, multi-team effort spanning fundamental research and product integration, today we are excited to announce that we have deployed a production ML model using federated learning with a rigorous differential privacy guarantee. For this proof-of-concept deployment, we utilized the DP-FTRL algorithm to train a recurrent neural network to power next-word-prediction for Spanish-language Gboard users. To our knowledge, this is the first production neural network trained directly on user data announced with a formal DP guarantee (technically ρ=0.81 zero-Concentrated-Differential-Privacy, zCDP, discussed in detail below). Further, the federated approach offers complimentary data minimization advantages, and the DP guarantee protects all of the data on each device, not just individual training examples.

Data Minimization and Anonymization in Federated Learning
Along with fundamentals like transparency and consent, the privacy principles of data minimization and anonymization are important in ML applications that involve sensitive data.

Federated learning systems structurally incorporate the principle of data minimization. FL only transmits minimal updates for a specific model training task (focused collection), limits access to data at all stages, processes individuals’ data as early as possible (early aggregation), and discards both collected and processed data as soon as possible (minimal retention).

Another principle that is important for models trained on user data is anonymization, meaning that the final model should not memorize information unique to a particular individual's data, e.g., phone numbers, addresses, credit card numbers. However, FL on its own does not directly tackle this problem.

The mathematical concept of DP allows one to formally quantify this principle of anonymization. Differentially private training algorithms add random noise during training to produce a probability distribution over output models, and ensure that this distribution doesn't change too much given a small change to the training data; ρ-zCDP quantifies how much the distribution could possibly change. We call this example-level DP when adding or removing a single training example changes the output distribution on models in a provably minimal way.

Showing that deep learning with example-level differential privacy was even possible in the simpler setting of centralized training was a major step forward in 2016. Achieved by the DP-SGD algorithm, the key was amplifying the privacy guarantee by leveraging the randomness in sampling training examples ("amplification-via-sampling").

However, when users can contribute multiple examples to the training dataset, example-level DP is not necessarily strong enough to ensure the users’ data isn't memorized. Instead, we have designed algorithms for user-level DP, which requires that the output distribution of models doesn't change even if we add/remove all of the training examples from any one user (or all the examples from any one device in our application). Fortunately, because FL summarizes all of a user's training data as a single model update, federated algorithms are well-suited to offering user-level DP guarantees.

Both limiting the contributions from one user and adding noise can come at the expense of model accuracy, however, so maintaining model quality while also providing strong DP guarantees is a key research focus.

The Challenging Path to Federated Learning with Differential Privacy
In 2018, we introduced the DP-FedAvg algorithm, which extended the DP-SGD approach to the federated setting with user-level DP guarantees, and in 2020 we deployed this algorithm to mobile devices for the first time. This approach ensures the training mechanism is not too sensitive to any one user's data, and empirical privacy auditing techniques rule out some forms of memorization.

However, the amplification-via-samping argument is essential to providing a strong DP guarantee for DP-FedAvg, but in a real-world cross-device FL system ensuring devices are subsampled precisely and uniformly at random from a large population would be complex and hard to verify. One challenge is that devices choose when to connect (or "check in") based on many external factors (e.g., requiring the device is idle, on unmetered WiFi, and charging), and the number of available devices can vary substantially.

Achieving a formal privacy guarantee requires a protocol that does all of the following:

  • Makes progress on training even as the set of devices available varies significantly with time.
  • Maintains privacy guarantees even in the face of unexpected or arbitrary changes in device availability.
  • For efficiency, allows client devices to locally decide whether they will check in to the server in order to participate in training, independent of other devices.

Initial work on privacy amplification via random check-ins highlighted these challenges and introduced a feasible protocol, but it would have required complex changes to our production infrastructure to deploy. Further, as with the amplification-via-sampling analysis of DP-SGD, the privacy amplification possible with random check-ins depends on a large number of devices being available. For example, if only 1000 devices are available for training, and participation of at least 1000 devices is needed in each training step, that requires either 1) including all devices currently available and paying a large privacy cost since there is no randomness in the selection, or 2) pausing the protocol and not making progress until more devices are available.

Achieving Provable Differential Privacy for Federated Learning with DP-FTRL
To address this challenge, the DP-FTRL algorithm is built on two key observations: 1) the convergence of gradient-descent-style algorithms depends primarily not on the accuracy of individual gradients, but the accuracy of cumulative sums of gradients; and 2) we can provide accurate estimates of cumulative sums with a strong DP guarantee by utilizing negatively correlated noise, added by the aggregating server: essentially, adding noise to one gradient and subtracting that same noise from a later gradient. DP-FTRL accomplishes this efficiently using the Tree Aggregation algorithm [1, 2].

The graphic below illustrates how estimating cumulative sums rather than individual gradients can help. We look at how the noise introduced by DP-FTRL and DP-SGD influence model training, compared to the true gradients (without added noise; in black) which step one unit to the right on each iteration. The individual DP-FTRL gradient estimates (blue), based on cumulative sums, have larger mean-squared-error than the individually-noised DP-SGD estimates (orange), but because the DP-FTRL noise is negatively correlated, some of it cancels out from step to step, and the overall learning trajectory stays closer to the true gradient descent steps.

To provide a strong privacy guarantee, we limit the number of times a user contributes an update. Fortunately, sampling-without-replacement is relatively easy to implement in production FL infrastructure: each device can remember locally which models it has contributed to in the past, and choose to not connect to the server for any later rounds for those models.

Production Training Details and Formal DP Statements
For the production DP-FTRL deployment introduced above, each eligible device maintains a local training cache consisting of user keyboard input, and when participating computes an update to the model which makes it more likely to suggest the next word the user actually typed, based on what has been typed so far. We ran DP-FTRL on this data to train a recurrent neural network with ~1.3M parameters. Training ran for 2000 rounds over six days, with 6500 devices participating per round. To allow for the DP guarantee, devices participated in training at most once every 24 hours. Model quality improved over the previous DP-FedAvg trained model, which offered empirically-tested privacy advantages over non-DP models, but lacked a meaningful formal DP guarantee.

The training mechanism we used is available in open-source in TensorFlow Federated and TensorFlow Privacy, and with the parameters used in our production deployment it provides a meaningfully strong privacy guarantee. Our analysis gives ρ=0.81 zCDP at the user level (treating all the data on each device as a different user), where smaller numbers correspond to better privacy in a mathematically precise way. As a comparison, this is stronger than the ρ=2.63 zCDP guarantee chosen by the 2020 US Census.

Next Steps
While we have reached the milestone of deploying a production FL model using a mechanism that provides a meaningfully small zCDP, our research journey continues. We are still far from being able to say this approach is possible (let alone practical) for most ML models or product applications, and other approaches to private ML exist. For example, membership inference tests and other empirical privacy auditing techniques can provide complimentary safeguards against leakage of users’ data. Most importantly, we see training models with user-level DP with even a very large zCDP as a substantial step forward, because it requires training with a DP mechanism that bounds the sensitivity of the model to any one user's data. Further, it smooths the road to later training models with improved privacy guarantees as better algorithms or more data become available. We are excited to continue the journey toward maximizing the value that ML can deliver while minimizing potential privacy costs to those who contribute training data.

Acknowledgements
The authors would like to thank Alex Ingerman and Om Thakkar for significant impact on the blog post itself, as well as the teams at Google that helped develop these ideas and bring them to practice:

  • Core research team: Galen Andrew, Borja Balle, Peter Kairouz, Daniel Ramage, Shuang Song, Thomas Steinke, Andreas Terzis, Om Thakkar, Zheng Xu
  • FL infrastructure team: Katharine Daly, Stefan Dierauf, Hubert Eichner, Igor Pisarev, Timon Van Overveldt, Chunxiang Zheng
  • Gboard team: Angana Ghosh, Xu Liu, Yuanbo Zhang
  • Speech team: Françoise Beaufays, Mingqing Chen, Rajiv Mathews, Vidush Mukund, Igor Pisarev, Swaroop Ramaswamy, Dan Zivkovic

Source: Google AI Blog


TAG Bulletin: Q1 2022

This bulletin includes coordinated influence operation campaigns terminated on our platforms in Q1 2022. It was last updated on February 28, 2022.

January

  • We terminated 3 YouTube channels as part of our investigation into coordinated influence operations. The campaign uploaded content in Arabic that was critical of former Sudanese president Omar al-Bashir and supportive of the 2019 Sudanese coup d’état. Our findings are similar to findings reported by Meta.
  • We terminated 1 AdSense account and 1 Play developer as part of our investigation into coordinated influence operations linked to Turkey. The campaign was sharing content in Arabic that was about news and current events in Libya. Our findings are similar to findings reported by Meta.
  • We terminated 42 YouTube channels and 2 Ads accounts as part of our investigation into coordinated influence operations linked to Iraq. The campaign uploaded content in Arabic that was in support of the Iraqi Harakat Hoquq party. We received leads from Mandiant that supported us in this investigation.
  • We terminated 4 YouTube channels, 2 AdSense accounts, and 1 Blogger blog and blocked 6 domains from eligibility to appear on Google News surfaces and Discover as part of our investigation into reported coordinated influence operations linked to Belarus, Moldova, and Ukraine. The campaign was sharing content in English that was about a variety of topics including US and European current events. We believe this operation was financially motivated.
  • We terminated 4361 YouTube channels as part of our ongoing investigation into coordinated influence operations linked to China. These channels mostly uploaded spammy content in Chinese about music, entertainment, and lifestyle. A very small subset uploaded content in Chinese and English about China and U.S. foreign affairs. These findings are consistent with our previous reports

New intelligent, content based detection and additional regional security detectors for data loss prevention

What’s changing 

We’re adding 40+ content detectors, which expand the type of content that data loss prevention (DLP) in Drive can scan and detect. 

New intelligent, machine learning based detectors for content inspection of documents, such as: 

  • SEC filings 
  • Legal briefs and court orders 
  • Tax documents 
  • Contracts 
  • Patents 
  • Resumes 
  • Finance Forms 
  • Source codes, system logs, and more. 

These machine based learning detectors are pre-trained to automatically detect sensitive content, requiring no additional work on the part of the admin. 

Additionally, we’ve added over forty new parameters for regional security, such as: 

  • Auth token 
  • API Keys 
  • Belgium ID 
  • Global VIN
  • Germany TIN 
  • India GST and more.

Visit the Help Center for a complete list of pre-defined detectors for DLP data loss prevention in Google Drive


Adding conditions  to define data that you want to scan for


Who’s impacted 

Admins 

Why it’s important 

Admins can use data loss prevention to create and apply rules to control what content your users can share in Google Drive files outside your organization, helping to prevent unintended exposure of sensitive information. 

These additional detectors, along with intelligent based scanning, help to further secure your environment and sensitive data. Administrators can enforce policies to restrict external sharing, applying classification labels, preventing uploads or warning users based on these intelligent detectors. 

Getting started 

  • Admins: This feature can be configured at the domain, OU, or group level within the DLP system at Admin console > Security > Data Protection. Use our Help Center to learn more about creating DLP for Drive rules and custom content detectors and using predefined content detectors. 
  • End users: No action required. 

Rollout pace 

Availability 

  • Available to Google Workspace Enterprise Standard, Enterprise Plus, Education Fundamentals, Education Standard, Education Plus, the Teaching and Learning Upgrade, as well as Cloud Identity Premium customers. 
  • Not available to Google Workspace Essentials, Business Starter, Business Standard, Business Plus, Enterprise Essentials, Frontline, and Nonprofits, legacy G Suite Basic and Business customers, and Cloud Identity Free customers. 

Resources 

Important changes to placement reporting for App Campaigns

On January 5, 2022, we removed all App campaign placement data from the following reports:

Google Ads API AdWords API / Google Ads scripts We made this change because the data provided didn’t fully represent the complete view of the placements that help developers monitor brand safety for their advertisers. If you use these reports, see the App Campaigns Brand Safety Placement report in the Google Ads UI.

If you have AdWords API or Google Ads API related questions about this change, please reach out to us on the API forum or at [email protected]. Note: AdWords API developers must migrate to Google Ads API by April 27, 2022

If you have any Google Ads scripts related questions, please reach out to us on the scripts forum.

Students in LATAM come together for continent-wide tech conference

Posted by Paco Solsona, Regional Lead LATAM



A continental community of coders

Growing up, many students across Latin America watched eagerly as the technology in their cities became more advanced and opportunities to create the future expanded. For some, computers and web technologies presented untold potential. Still, excitement about doing right by their communities was all at the heart of it all. Now, a forward-looking group of university students from 27 different Latin American nations and Google Developer Student Clubs (GDSC) have formed a continent-wide network to chart a course forward for their continent. They are building a community of Spanish-speaking Latin American student developers that support each other, help foster leadership skills, and bring more opportunities to student developers in the region.

Teaming up to build skills and teach other student developers

In November 2021, this regional coalition of students came together to host a continent-wide LATAM conference, a two-day student conference (the team planned and executed it in just two weeks). The event featured ten speakers from Spanish-speaking Latin American countries and taught students about different developer technologies. Attendees learned about machine learning, automating processes using data pipelines, leveraging react to upload landing pages to Firebase, and building mobile applications with Firebase and React Native. 300 people attended the conference over two days, and the conference recordings have attracted hundreds of views on YouTube.

Screenshot of a group of GDSC leads video chatting during a live event

“We’re coming from a less developed region. We grew up seeing other countries that were more technologically advanced. Now, developers from Latin America are more confident that they have the skills to implement projects, produce new things, and bring advancement to the continent.” - Maria Agustina Cuello (Chichi)

Working together with purpose

Through working together on the conference, the organizers of LATAM conference know Latin American youth have a bright future. They are excited by the opportunity to use the power of technology and connectivity to change the world.

Screenshot of a group of women GDSC leads video chatting during a live event

Luis Eduardo, Lead GDSC UTP (Perú), says it felt amazing to be part of the LATAM conference: “being able to meet students from other countries with the same desire to work for the community was wonderful. Knowing that, despite being thousands of miles away, there was no impediment to being able to work as an organized team. This is what makes this family unique.”

Screenshot of a group of GDSC members video chatting during a live event

“LATAM conference was the opportunity to show that wherever we are, we can help others, and you will always find people with similar ideas,” says Francisco Imanol Suarez, Lead GDSC UNPSJB (Argentina).

Solution Challenge preparations

The group is now hosting events to teach student developers new skills and prepare them for the 2022 Solution Challenge, a global contest where students from around the world are invited to solve for one of the United Nations' Sustainable Development Goals using Google technologies.

In preparing their communities to build projects, the group plans to activate the countries and regions in Latin America. The students aim to expose each other to multiple technologies in the field and plan to host theme weeks for the Solution Challenge, like a Firebase week, a UX/UI week, and a Flutter Festival.

Students across the GDSC LATAMs are forming teams for the Solution Challenge. Some are local, coming from a single university, while others are broader, like students in Argentina working with students from Mexico. “A few months ago, no one knew how many people we would help take their first steps in the world of development. Let's hope this community continues to grow to be able to show that amazing things can be done in LATAM,” says Luis Eduardo, Lead GDSC UTP (Perú).

Screenshot of a GDSC student giving a presentation on Google technology via video chat

“I’m grateful to be part of this community and work with amazing team members who are so eager to work together and do activities. We want to bring all the opportunities we can to Latin American students, and gender and language are not a barrier,” says Cuello.

What’s next for GDSC LATAM

The members of GDSC LATAM plan to continue hosting collaborative events for the community such as Google Cloud Machine Learning bootcamp, a hackathon, and a 2022 student conference and related events with other student communities. The group holds Android and Google Cloud Platform (GCP) study jams, publishes a podcast, and hosts networking events to help reach more students, create networking opportunities, and expand each university’s GDSC. Eventually, they hope to positively impact the region by encouraging budding developers to build new technologies in Latin America.

If this inspires you, sign up for the Solution Challenge and submit a project by March 31, 2022 at goo.gle/solutionchallenge and join a Google Developer Student Club at your college or university.

Check out GDSC LATAM on social media: Twitter | FB | YouTube Channel | Instagram

Students in LATAM come together for continent-wide tech conference

Posted by Paco Solsona, Regional Lead LATAM



A continental community of coders

Growing up, many students across Latin America watched eagerly as the technology in their cities became more advanced and opportunities to create the future expanded. For some, computers and web technologies presented untold potential. Still, excitement about doing right by their communities was all at the heart of it all. Now, a forward-looking group of university students from 27 different Latin American nations and Google Developer Student Clubs (GDSC) have formed a continent-wide network to chart a course forward for their continent. They are building a community of Spanish-speaking Latin American student developers that support each other, help foster leadership skills, and bring more opportunities to student developers in the region.

Teaming up to build skills and teach other student developers

In November 2021, this regional coalition of students came together to host a continent-wide LATAM conference, a two-day student conference (the team planned and executed it in just two weeks). The event featured ten speakers from Spanish-speaking Latin American countries and taught students about different developer technologies. Attendees learned about machine learning, automating processes using data pipelines, leveraging react to upload landing pages to Firebase, and building mobile applications with Firebase and React Native. 300 people attended the conference over two days, and the conference recordings have attracted hundreds of views on YouTube.

Screenshot of a group of GDSC leads video chatting during a live event

“We’re coming from a less developed region. We grew up seeing other countries that were more technologically advanced. Now, developers from Latin America are more confident that they have the skills to implement projects, produce new things, and bring advancement to the continent.” - Maria Agustina Cuello (Chichi)

Working together with purpose

Through working together on the conference, the organizers of LATAM conference know Latin American youth have a bright future. They are excited by the opportunity to use the power of technology and connectivity to change the world.

Screenshot of a group of women GDSC leads video chatting during a live event

Luis Eduardo, Lead GDSC UTP (Perú), says it felt amazing to be part of the LATAM conference: “being able to meet students from other countries with the same desire to work for the community was wonderful. Knowing that, despite being thousands of miles away, there was no impediment to being able to work as an organized team. This is what makes this family unique.”

Screenshot of a group of GDSC members video chatting during a live event

“LATAM conference was the opportunity to show that wherever we are, we can help others, and you will always find people with similar ideas,” says Francisco Imanol Suarez, Lead GDSC UNPSJB (Argentina).

Solution Challenge preparations

The group is now hosting events to teach student developers new skills and prepare them for the 2022 Solution Challenge, a global contest where students from around the world are invited to solve for one of the United Nations' Sustainable Development Goals using Google technologies.

In preparing their communities to build projects, the group plans to activate the countries and regions in Latin America. The students aim to expose each other to multiple technologies in the field and plan to host theme weeks for the Solution Challenge, like a Firebase week, a UX/UI week, and a Flutter Festival.

Students across the GDSC LATAMs are forming teams for the Solution Challenge. Some are local, coming from a single university, while others are broader, like students in Argentina working with students from Mexico. “A few months ago, no one knew how many people we would help take their first steps in the world of development. Let's hope this community continues to grow to be able to show that amazing things can be done in LATAM,” says Luis Eduardo, Lead GDSC UTP (Perú).

Screenshot of a GDSC student giving a presentation on Google technology via video chat

“I’m grateful to be part of this community and work with amazing team members who are so eager to work together and do activities. We want to bring all the opportunities we can to Latin American students, and gender and language are not a barrier,” says Cuello.

What’s next for GDSC LATAM

The members of GDSC LATAM plan to continue hosting collaborative events for the community such as Google Cloud Machine Learning bootcamp, a hackathon, and a 2022 student conference and related events with other student communities. The group holds Android and Google Cloud Platform (GCP) study jams, publishes a podcast, and hosts networking events to help reach more students, create networking opportunities, and expand each university’s GDSC. Eventually, they hope to positively impact the region by encouraging budding developers to build new technologies in Latin America.

If this inspires you, sign up for the Solution Challenge and submit a project by March 31, 2022 at goo.gle/solutionchallenge and join a Google Developer Student Club at your college or university.

Check out GDSC LATAM on social media: Twitter | FB | YouTube Channel | Instagram

Constrained Reweighting for Training Deep Neural Nets with Noisy Labels

Over the past several years, deep neural networks (DNNs) have been quite successful in driving impressive performance gains in several real-world applications, from image recognition to genomics. However, modern DNNs often have far more trainable model parameters than the number of training examples and the resulting overparameterized networks can easily overfit to noisy or corrupted labels (i.e., examples that are assigned a wrong class label). As a consequence, training with noisy labels often leads to degradation in accuracy of the trained model on clean test data. Unfortunately, noisy labels can appear in several real-world scenarios due to multiple factors, such as errors and inconsistencies in manual annotation and the use of inherently noisy label sources (e.g., the internet or automated labels from an existing system).

Earlier work has shown that representations learned by pre-training large models with noisy data can be useful for prediction when used in a linear classifier trained with clean data. In principle, it is possible to directly train machine learning (ML) models on noisy data without resorting to this two-stage approach. To be successful, such alternative methods should have the following properties: (i) they should fit easily into standard training pipelines with little computational or memory overhead; (ii) they should be applicable in “streaming” settings where new data is continuously added during training; and (iii) they should not require data with clean labels.

In “Constrained Instance and Class Reweighting for Robust Learning under Label Noise”, we propose a novel and principled method, named Constrained Instance reWeighting (CIW), with these properties that works by dynamically assigning importance weights both to individual instances and to class labels in a mini-batch, with the goal of reducing the effect of potentially noisy examples. We formulate a family of constrained optimization problems that yield simple solutions for these importance weights. These optimization problems are solved per mini-batch, which avoids the need to store and update the importance weights over the full dataset. This optimization framework also provides a theoretical perspective for existing label smoothing heuristics that address label noise, such as label bootstrapping. We evaluate the method with varying amounts of synthetic noise on the standard CIFAR-10 and CIFAR-100 benchmarks and observe considerable performance gains over several existing methods.

Method
Training ML models involves minimizing a loss function that indicates how well the current parameters fit to the given training data. In each training step, this loss is approximately calculated as a (weighted) sum of the losses of individual instances in the mini-batch of data on which it is operating. In standard training, each instance is treated equally for the purpose of updating the model parameters, which corresponds to assigning uniform (i.e., equal) weights across the mini-batch.

However, empirical observations made in earlier works reveal that noisy or mislabeled instances tend to have higher loss values than those that are clean, particularly during early to mid-stages of training. Thus, assigning uniform importance weights to all instances means that due to their higher loss values, the noisy instances can potentially dominate the clean instances and degrade the accuracy on clean test data.

Motivated by these observations, we propose a family of constrained optimization problems that solve this problem by assigning importance weights to individual instances in the dataset to reduce the effect of those that are likely to be noisy. This approach provides control over how much the weights deviate from uniform, as quantified by a divergence measure. It turns out that for several types of divergence measures, one can obtain simple formulae for the instance weights. The final loss is computed as the weighted sum of individual instance losses, which is used for updating the model parameters. We call this the Constrained Instance reWeighting (CIW) method. This method allows for controlling the smoothness or peakiness of the weights through the choice of divergence and a corresponding hyperparameter.

Schematic of the proposed Constrained Instance reWeighting (CIW) method.

Illustration with Decision Boundary on a 2D Dataset
As an example to illustrate the behavior of this method, we consider a noisy version of the Two Moons dataset, which consists of randomly sampled points from two classes in the shape of two half moons. We corrupt 30% of the labels and train a multilayer perceptron network on it for binary classification. We use the standard binary cross-entropy loss and an SGD with momentum optimizer to train the model. In the figure below (left panel), we show the data points and visualize an acceptable decision boundary separating the two classes with a dotted line. The points marked red in the upper half-moon and those marked green in the lower half-moon indicate noisy data points.

The baseline model trained with the binary cross-entropy loss assigns uniform weights to the instances in each mini-batch, thus eventually overfitting to the noisy instances and resulting in a poor decision boundary (middle panel in the figure below).

The CIW method reweights the instances in each mini-batch based on their corresponding loss values (right panel in the figure below). It assigns larger weights to the clean instances that are located on the correct side of the decision boundary and damps the effect of noisy instances that incur a higher loss value. Smaller weights for noisy instances help in preventing the model from overfitting to them, thus allowing the model trained with CIW to successfully converge to a good decision boundary by avoiding the impact of label noise.

Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.

Constrained Class reWeighting
Instance reweighting assigns lower weights to instances with higher losses. We further extend this intuition to assign importance weights over all possible class labels. Standard training uses a one-hot label vector as the class weights, assigning a weight of 1 to the labeled class and 0 to all other classes. However, for the potentially mislabeled instances, it is reasonable to assign non-zero weights to classes that could be the true label. We obtain these class weights as solutions to a family of constrained optimization problems where the deviation of the class weights from the label one-hot distribution, as measured by a divergence of choice, is controlled by a hyperparameter.

Again, for several divergence measures, we can obtain simple formulae for the class weights. We refer to this as Constrained Instance and Class reWeighting (CICW). The solution to this optimization problem also recovers the earlier proposed methods based on static label bootstrapping (also referred as label smoothing) when the divergence is taken to be total variation distance. This provides a theoretical perspective on the popular method of static label bootstrapping.

Using Instance Weights with Mixup
We also propose a way to use the obtained instance weights with mixup, which is a popular method for regularizing models and improving prediction performance. It works by sampling a pair of examples from the original dataset and generating a new artificial example using a random convex combination of these. The model is trained by minimizing the loss on these mixed-up data points. Vanilla mixup is oblivious to the individual instance losses, which might be problematic for noisy data because mixup will treat clean and noisy examples equally. Since a high instance weight obtained with our CIW method is more likely to indicate a clean example, we use our instance weights to do a biased sampling for mixup and also use the weights in convex combinations (instead of random convex combinations in vanilla mixup). This results in biasing the mixed-up examples towards clean data points, which we refer to as CICW-Mixup.

We apply these methods with varying amounts of synthetic noise (i.e., the label for each instance is randomly flipped to other labels) on the standard CIFAR-10 and CIFAR-100 benchmark datasets. We show the test accuracy on clean data with symmetric synthetic noise where the noise rate is varied between 0.2 and 0.8.

We observe that the proposed CICW outperforms several methods and matches the results of dynamic mixup, which maintains the importance weights over the full training set with mixup. Using our importance weights with mixup in CICW-M, resulted in significantly improved performance vs these methods, particularly for larger noise rates (as shown by lines above and to the right in the graphs below).

Test accuracy on clean data while varying the amount of symmetric synthetic noise in the training data for CIFAR-10 and CIFAR-100. Methods compared are: standard Cross-Entropy Loss (CE), Bi-tempered Loss, Active-Passive Normalized Loss, the proposed CICW, Mixup, Dynamic Mixup, and the proposed CICW-Mixup.

Summary and Future Directions
We formulate a novel family of constrained optimization problems for tackling label noise that yield simple mathematical formulae for reweighting the training instances and class labels. These formulations also provide a theoretical perspective on existing label smoothing–based methods for learning with noisy labels. We also propose ways for using the instance weights with mixup that results in further significant performance gains over instance and class reweighting. Our method operates solely at the level of mini-batches, which avoids the extra overhead of maintaining dataset-level weights as in some of the recent methods.

As a direction for future work, we would like to evaluate the method on realistic noisy labels that are encountered in large scale practical settings. We also believe that studying the interaction of our framework with label smoothing is an interesting direction that can result in a loss adaptive version of label smoothing. We are also excited to release the code for CICW, now available on Github.

Acknowledgements
We'd like to thank Kevin Murphy for providing constructive feedback during the course of the project.

Source: Google AI Blog