Tag Archives: ML Fairness

Introducing the Model Card Toolkit for Easier Model Transparency Reporting

Posted by Huanming Fang and Hui Miao, Software Engineers, Google Research

Machine learning (ML) model transparency is important across a wide variety of domains that impact peoples’ lives, from healthcare to personal finance to employment. The information needed by downstream users will vary, as will the details that developers need in order to decide whether or not a model is appropriate for their use case. This desire for transparency led us to develop a new tool for model transparency, Model Cards, which provide a structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations that can benefit developers, regulators, and downstream users alike.

Over the past year, we’ve launched Model Cards publicly and worked to create Model Cards for open-source models released by teams across Google. For example, the MediaPipe team creates state-of-the-art computer vision models for a number of common tasks, and has included Model Cards for each of their open-source models in their GitHub repository. Creating Model Cards like these takes substantial time and effort, often requiring a detailed evaluation and analysis of both data and model performance. In many cases, one needs to additionally evaluate how a model performs on different subsets of data, noting any areas where the model underperforms. Further, Model Card creators may want to report on the model’s intended uses and limitations, as well as any ethical considerations potential users might find useful, compiling and presenting the information in a format that’s accessible and understandable.

To streamline the creation of Model Cards for all ML practitioners, we are sharing the Model Card Toolkit (MCT), a collection of tools that support developers in compiling the information that goes into a Model Card and that aid in the creation of interfaces that will be useful for different audiences. To demonstrate how the MCT can be used in practice, we have also released a Colab tutorial that builds a Model Card for a simple classification model trained on the UCI Census Income dataset.

Introducing the MCT
To guide the Model Card creator to organize model information, we provide a JSON schema, which specifies the fields to include in the Model Card. Using the model provenance information stored with ML Metadata (MLMD), the MCT automatically populates the JSON with relevant information, such as class distributions in the data and model performance statistics. We also provide a ModelCard data API to represent an instance of the JSON schema and visualize it as a Model Card. The Model Card creator can choose which metrics and graphs to display in the final Model Card, including metrics that highlight areas where the model’s performance might deviate from its overall performance.

Once the MCT has populated the Model Card with key metrics and graphs, the Model Card creator can supplement this with information regarding the model’s intended usage, limitations, trade-offs, and any other ethical considerations that would otherwise be unknown to people using the model. If a model underperforms for certain slices of data, the limitations section would be another place to acknowledge this, along with suggested mitigation strategies to help developers address these issues. This type of information is critical in helping developers decide whether or not a model is suitable for their use case, and helps Model Card creators provide context so that their models are used appropriately. Right now, we’re providing one UI template to visualize the Model Card, but you can create different templates in HTML should you want to visualize the information in other formats.

Currently, the MCT is available to anyone using TensorFlow Extended (TFX) in open source or on Google Cloud Platform. Users who are not serving their ML models via TFX can still leverage the JSON schema and the methods to visualize via the HTML template.

Here is an example of the completed Model Card from the Colab tutorial, which leverages the MCT and the provided UI template.

Conclusion
Currently, the MCT includes a standard template for reporting on ML models broadly, but we’re continuing to create UI templates for more specific applications of ML. If you’d like to join the conversation about what fields are important and how best to leverage the MCT for different use cases, you can get started here or with the Colab tutorial. Let us know how you’ve leveraged the MCT for your use case by emailing us at [email protected]. You can learn more about Google’s efforts to promote responsible AI in the TensorFlow ecosystem on our TensorFlow Responsible AI page.

Acknowledgements
Huanming Fang, Hui Miao, Karan Shukla, Dan Nanas, Catherina Xu, Christina Greer, Tulsee Doshi, Tiffany Deng, Margaret Mitchell, Timnit Gebru, Andrew Zaldivar, Mahima Pushkarna, Meena Natarajan, Roy Kim, Parker Barnes, Tom Murray, Susanna Ricco, Lucy Vasserman, and Simone Wu

Source: Google AI Blog

Setting Fairness Goals with the TensorFlow Constrained Optimization Library

Posted by Andrew Zaldivar, Responsible AI Advocate, Google Research, on behalf of the TFCO Team

Many technologies that use supervised machine learning are having an increasingly positive impact on peoples’ day-to-day lives, from catching early signs of illnesses to filtering inappropriate content. There is, however, a growing concern that learned models, which generally satisfy the narrow requirement of minimizing a single loss function, may have difficulty addressing broader societal issues such as fairness, which generally requires trading-off multiple competing considerations. Even when such factors are taken into account, these systems may still be incapable of satisfying such complex design requirements, for example that a false negative might be “worse” than a false positive, or that the model being trained should be “similar” to a pre-existing model.

The TensorFlow Constrained Optimization (TFCO) library makes it easy to configure and train machine learning problems based on multiple different metrics (e.g. the precisions on members of certain groups, the true positive rates on residents of certain countries, or the recall rates of cancer diagnoses depending on age and gender). While these metrics are simple conceptually, by offering a user the ability to minimize and constrain arbitrary combinations of them, TFCO makes it easy to formulate and solve many problems of interest to the fairness community in particular (such as equalized odds and predictive parity) and the machine learning community more generally.

How Does TFCO Relate to Our AI Principles?
The release of TFCO puts our AI Principles into action, further helping guide the ethical development and use of AI in research and in practice. By putting TFCO into the hands of developers, we aim to better equip them to identify where their models can be risky and harmful, and to set constraints that ensure their models achieve desirable outcomes.

What Are the Goals?
Borrowing an example from Hardt et al., consider the task of learning a classifier that decides whether a person should receive a loan (a positive prediction) or not (negative), based on a dataset of people who either are able to repay a loan (a positive label), or are not (negative). To set up this problem in TFCO, we would choose an objective function that rewards the model for granting loans to those people who will pay them back, and would also impose fairness constraints that prevent it from unfairly denying loans to certain protected groups of people. In TFCO, the objective to minimize, and the constraints to impose, are represented as algebraic expressions (using normal Python operators) of simple basic rates.

Instructing TFCO to minimize the overall error rate of the learned classifier for a linear model (with no fairness constraints), might yield a decision boundary that looks like this:

Illustration of a binary classification dataset with two protected groups: blue and orange. For ease of visualization, rather than plotting each individual data point, the densities are represented as ovals. The positive and negative signs denote the labels. The decision boundary drawn as a black dashed line separating positive predictions (regions above the line) and negative (regions below the line) labels, chosen to maximize accuracy.

This is a fine classifier, but in certain applications, one might consider it to be unfair. For example, positively-labeled blue examples are much more likely to receive negative predictions than positively-labeled orange examples, violating the “equal opportunity” principle. To correct this, one could add an equal opportunity constraint to the constraint list. The resulting classifier would now look something like this:

Here the decision boundary is chosen to maximize the accuracy, subject to an equal opportunity (or true positive rate) constraint.

How Do I Know What Constraints To Set?
Choosing the “right” constraints depends on the policy goals or requirements of your problem and your users. For this reason, we’ve striven to avoid forcing the user to choose from a curated list of “baked-in” problems. Instead, we’ve tried to maximize flexibility by enabling the user to define an extremely broad range of possible problems, by combining and manipulating simple basic rates.

This flexibility can have a downside: if one isn’t careful, one might attempt to impose contradictory constraints, resulting in a constrained problem with no good solutions. In the context of the above example, one could constrain the false positive rates (FPRs) to be equal, in addition to the true positive rates (TPRs) (i.e., “equalized odds”). However, the potentially contradictory nature of these two sets of constraints, coupled with our requirement for a linear model, could force us to find a solution with extremely low accuracy. For example:

Here the decision boundary is chosen to maximize the accuracy, subject to both the true positive rate and false positive rate constraints.

With an insufficiently-flexible model, either the FPRs of both groups would be equal, but very large (as in the case illustrated above), or the TPRs would be equal, but very small (not shown).

Can It Fail?
The ability to express many fairness goals as rate constraints can help drive progress in the responsible development of machine learning, but it also requires developers to carefully consider the problem they are trying to address. For example, suppose one constrains the training to give equal accuracy for four groups, but that one of those groups is much harder to classify. In this case, it could be that the only way to satisfy the constraints is by decreasing the accuracy of the three easier groups, so that they match the low accuracy of the fourth group. This probably isn’t the desired outcome.

A “safer” alternative is to constrain each group to independently satisfy some absolute metric, for example by requiring each group to achieve at least 75% accuracy. Using such absolute constraints rather than relative constraints will generally keep the groups from dragging each other down. Of course, it is possible to ask for a minimum accuracy that isn’t achievable, so some conscientiousness is still required.

The Curse of Small Sample Sizes
Another common challenge with using constrained optimization is that the groups to which constraints are applied may be under-represented in the dataset. Consequently, the stochastic gradients we compute during training will be very noisy, resulting in slow convergence. In such a scenario, we recommend that users impose the constraints on a separate rebalanced dataset that contains higher proportions from each group, and use the original dataset only to minimize the objective.

For example, in the Wiki toxicity example we provide, we wish to predict if a discussion comment posted on a Wiki talk page is toxic (i.e., contains “rude, disrespectful or unreasonable” content). Only 1.3% of the comments mention a term related to “sexuality”, and a large fraction of these comments are labelled toxic. Hence, training a CNN model without constraints on this dataset leads to the model believing that “sexuality” is a strong indicator of toxicity and results in a high false positive rate for this group. We use TFCO to constrain the false positive rate for four sensitive topics (sexuality, gender identity, religion and race) to be within 2%. To better handle the small group sizes, we use a “re-balanced” dataset to enforce the constraints and the original dataset only to minimize the objective. As shown below, the constrained model is able to significantly lower the false positive rates on the four topic groups, while maintaining almost the same accuracy as the unconstrained model.

Comparison of unconstrained and constrained CNN models for classifying toxic comments on Wiki Talk pages.

Intersectionality – The Challenge of Fine Grained Groups
Overlapping constraints can help create equitable experiences for multiple categories of historically marginalized and minority groups. Extending beyond the above example, we also provide a CelebA example that examines a computer vision model for detecting smiles in images that we wish to perform well across multiple non-mutually-exclusive protected groups. The false positive rate can be an appropriate metric here, since it measures the fraction of images not containing a smiling face that are incorrectly labeled as smiling. By comparing false positive rates based on available age group (young and old) or sex (male and female) categories, we can check for undesirable model bias (i.e., whether images of older people that are smiling are not recognized as such).

Comparison of unconstrained and constrained CNN models for classifying toxic comments on Wiki Talk pages.

Under the Hood
Correctly handling rate constraints is challenging because, being written in terms of counts (e.g., the accuracy rate is the number of correct predictions, divided by the number of examples), the constraint functions are non-differentiable. Algorithmically, TFCO converts a constrained problem into a non-zero-sum two-player game (ALT’19, JMLR’19). This framework can be extended to handle the ranking and regression settings (AAAI’20), more complex metrics such as the F-measure (NeurIPS’19a), or to improve generalization performance (ICML’19).

It is our belief that the TFCO library will be useful in training ML models that take into account the societal and cultural factors necessary to satisfy real-world requirements. Our provided examples (toxicity classification and smile detection) only scratch the surface. We hope that TFCO’s flexibility enables you to handle your problem’s unique requirements.

Acknowledgements
This work was a collaborative effort by the authors of TFCO and associated research papers, including Andrew Cotter, Maya R. Gupta, Heinrich Jiang, Harikrishna Narasimhan, Taman Narayan, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, and Seungil You.

Source: Google AI Blog

ML-fairness-gym: A Tool for Exploring Long-Term Impacts of Machine Learning Systems

Posted by Hansa Srinivasan, Software Engineer, Google Research

Machine learning systems have been increasingly deployed to aid in high-impact decision-making, such as determining criminal sentencing, child welfare assessments, who receives medical attention and many other settings. Understanding whether such systems are fair is crucial, and requires an understanding of models’ short- and long-term effects. Common methods for assessing the fairness of machine learning systems involve evaluating disparities in error metrics on static datasets for various inputs to the system. Indeed, many existing ML fairness toolkits (e.g., AIF360, fairlearn, fairness-indicators, fairness-comparison) provide tools for performing such error-metric based analysis on existing datasets. While this sort of analysis may work for systems in simple environments, there are cases (e.g., systems with active data collection or significant feedback loops) where the context in which the algorithm operates is critical for understanding its impact. In these cases, the fairness of algorithmic decisions ideally would be analyzed with greater consideration for the environmental and temporal context than error metric-based techniques allow.

In order to facilitate algorithmic development with this broader context, we have released ML-fairness-gym, a set of components for building simple simulations that explore potential long-run impacts of deploying machine learning-based decision systems in social environments. In “Fairness is not Static: Deeper Understanding of Long Term Fairness via Simulation Studies” we demonstrate how the ML-fairness-gym can be used to research the long-term effects of automated decision systems on a number of established problems from current machine learning fairness literature.

An Example: The Lending Problem
A classic problem for considering fairness in machine learning systems is the lending problem, as described by Liu et al. This problem is a highly simplified and stylized representation of the lending process, where we focus on a single feedback loop in order to isolate its effects and study it in detail. In this problem formulation, the probability that individual applicants will pay back a loan is a function of their credit score. These applicants also belong to one of an arbitrary number of groups, with their group membership observable by the lending bank.

The groups start with different credit score distributions. The bank is trying to determine a threshold on the credit scores, applied across groups or tailored to each, that best enables the bank to reach its objectives. Applicants with scores higher than the threshold receive loans, and those with lower scores are rejected. When the simulation selects an individual, whether or not they will pay the loan is randomly determined based on their group’s probability of payback. In this example, individuals currently applying for loans may apply for additional loans in the future and thus, by paying back their loan, both their credit score and their group’s average credit score increases. Similarly, if the applicant defaults, the group’s average credit score decreases.

The most effective threshold settings will depend on the bank’s goals. A profit-maximizing bank may set a threshold that maximizes the predicted return, based on the estimated likelihood that applicants will repay their loans. Another bank, seeking to be fair to both groups, may try to implement thresholds that maximize profit while satisfying equality of opportunity, the goal of which is to have equal true positive rates (TPR is also called recall or sensitivity; a measure of what fraction of applicants who would have paid back loans were given a loan). In this scenario, machine learning techniques are employed by the bank to determine the most effective threshold based on loans that have been distributed and their outcomes. However, since these techniques are often focused on short-term objectives, they may have unintended and unfair consequences for different groups.

Top:Changing credit score distributions for the two groups over 100 steps of simulation. Bottom: (left) The bank cash and (right) the TPR for group 1 in blue and group 2 in green over the course of the simulation.

Deficiencies in Static Dataset Analysis
A standard practice in machine learning to assess the impact of a scenario like the lending problem is to reserve a portion of the data as a “test set”, and use that to calculate relevant performance metrics. Fairness is then assessed by looking at how those performance metrics differ across salient groups. However, it is well understood that there are two main issues with using test sets like this in systems with feedback. If test sets are generated from existing systems, they may be incomplete or reflect the biases inherent to those systems. In the lending example, a test set could be incomplete because it may only have information on whether an applicant who has been given a loan has defaulted or repaid. Consequently, the dataset may not include individuals for whom loans have not been approved or who have not had access to loans before.

The second issue is that actions informed by the output of the ML system can have effects that may influence their future input. The thresholds determined by the ML system are used to extend loans. Whether people default or repay these loans then affects their future credit score, which then feed back into the ML system.

These issues highlight the shortcomings of assessing fairness in static datasets and motivate the need for analyzing the fairness of algorithms in the context of the dynamic systems in which they are deployed. We created the ML-fairness-gym framework to help ML practitioners bring simulation-based analysis to their ML systems, an approach that has proven effective in many fields for analyzing dynamic systems where closed form analysis is difficult.

ML-fairness-gym as a Simulation Tool for Long-Term Analysis
The ML-fairness-gym simulates sequential decision making using Open AI’s Gym framework. In this framework, agents interact with simulated environments in a loop. At each step, an agent chooses an action that then affects the environment’s state. The environment then reveals an observation that the agent uses to inform its subsequent actions. In this framework, environments model the system and dynamics of the problem and observations serve as data to the agent, which can be encoded as a machine learning system.

Flow chart schematic of the agent-environment interaction loop used in the simulation framework. Agents affect environments via a choice of action. Environments change in response to the action and yield parts of their internal state as an observation. Metrics examine the history of the environment to evaluate outcomes.

In the lending example, the bank acts as the agent. It receives loan applicants, their credit scores and their group membership in the form of observations from the environment, and takes actions in the form of a binary decision to either accept or reject for a loan. The environment then models whether the applicant successfully repays or defaults, and adjusts their credit score accordingly. The ML-fairness-gym simulates the outcomes so that the long-term effects of the bank’s policies on fairness to the applicant population can be assessed.

Fairness Is Not Static: Extending the Analysis to the Long-Term
Since Liu et al.’s original formulation of the lending problem examined only the short-term consequences of the bank’s policies — including short-term profit-maximizing policies (called the max reward agent) and policies subject to an equality of opportunity (EO) constraint — we use the ML-fairness-gym to extend the analysis to the long-term (many steps) via simulation.

Top: Cumulative loans granted by the max reward and EO agents, stratified by the group identity of the applicant. Bottom: Group average credit (quantified by group-conditional probability of repayment) as the simulation progresses. The EO agent increases access to loans for group 2, but also widens the credit gap between the groups.

Our long-term analysis found two results. First, as found by Liu et al., the equal opportunity agent (EO agent) overlends to the disadvantaged group (group 2, which initially has a lower average credit score) by sometimes applying a lower threshold for the group than would be applied by the max reward agent. This causes the credit scores of group 2 to decrease more than group 1, resulting in a wider credit score gap between the groups than in the simulations with the max reward agent. However, our analysis also found that while group 2 may seem worse off with the EO agent, from looking at the Cumulative loans graph, we see that the disadvantaged group 2 receives significantly more loans from the EO agent. Depending on whether the indicator of welfare is the credit score or total loans received, it could be argued that the EO agent is better or more detrimental to group 2 than the max reward agent.

The second finding is that equal opportunity constraints — enforcing equalized TPR between groups at each step — does not equalize TPR in aggregate over the simulation. This perhaps counterintuitive result can be thought of as an instance of Simpson’s paradox. As seen in the chart below, equal TPR in each of two years does not imply equal TPR in aggregate. This demonstrates how the equality of opportunity metric is difficult to interpret when the underlying population is evolving, and suggests that more careful analysis is necessary to ensure that the ML system is having the desired effects.

An example of Simpson's paradox. TP are the true positive classifications, FN corresponds to the false negative classifications and TPR is the true positive rate. In years 1 and 2, the lender applies a policy that achieves equal TPR between the two groups. The aggregation over both years does not have equal TPR.

Conclusion and Future Work
While we focused on our findings for the lending problem in this blog post, the ML-fairness-gym can be used to tackle a wide variety of fairness problems. Our paper extends the analysis of two other scenarios that have been previously studied in the academic ML fairness literature. The ML-fairness-gym framework is also flexible enough to simulate and explore problems where “fairness” is under-explored. For example, in a supporting paper, “Fair treatment allocations in social networks,” we explore a stylized version of epidemic control, which we call the precision disease control problem, to better understand notions of fairness across individuals and communities in a social network.

We’re excited about the potential of the ML-fairness-gym to help other researchers and machine learning developers better understand the effects that machine learning algorithms have on our society, and to inform the development of more responsible and fair machine learning systems. Find the code and papers in the ML-fairness-gym Github repository.

Source: Google AI Blog

Fairness Indicators: Scalable Infrastructure for Fair ML Systems

Posted by Catherina Xu and Tulsee Doshi, Product Managers, Google Research

While industry and academia continue to explore the benefits of using machine learning (ML) to make better products and tackle important problems, algorithms and the datasets on which they are trained also have the ability to reflect or reinforce unfair biases. For example, consistently flagging non-toxic text comments from certain groups as “spam” or “high toxicity” in a moderation system leads to exclusion of those groups from conversation.

In 2018, we shared how Google uses AI to make products more useful, highlighting AI principles that will guide our work moving forward. The second principle, “Avoid creating or reinforcing unfair bias,” outlines our commitment to reduce unjust biases and minimize their impacts on people.

As part of this commitment, at TensorFlow World, we recently released a beta version of Fairness Indicators, a suite of tools that enable regular computation and visualization of fairness metrics for binary and multi-class classification, helping teams take a first step towards identifying unjust impacts. Fairness Indicators can be used to generate metrics for transparency reporting, such as those used for model cards, to help developers make better decisions about how to deploy models responsibly. Because fairness concerns and evaluations differ case by case, we also include in this release an interactive case study with Jigsaw’s Unintended Bias in Toxicity dataset to illustrate how Fairness Indicators can be used to detect and remediate bias in a production machine learning (ML) model, depending on the context in which it is deployed. Fairness Indicators is now available in beta for you to try for your own use cases.

What is ML Fairness?
Bias can manifest in any part of a typical machine learning pipeline, from an unrepresentative dataset, to learned model representations, to the way in which the results are presented to the user. Errors that result from this bias can disproportionately impact some users more than others.

To detect this unequal impact, evaluation over individual slices, or groups of users, is crucial as overall metrics can obscure poor performance for certain groups. These groups may include, but are not limited to, those defined by sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and religious belief. However, it is also important to keep in mind that fairness cannot be achieved solely through metrics and measurement; high performance, even across slices, does not necessarily prove that a system is fair. Rather, evaluation should be viewed as one of the first ways, especially for classification models, to identify gaps in performance.

The Fairness Indicators Suite of Tools
The Fairness Indicators tool suite enables computation and visualization of commonly-identified fairness metrics for classification models, such as false positive rate and false negative rate, making it easy to compare performance across slices or to a baseline slice. The tool computes confidence intervals, which can surface statistically significant disparities, and performs evaluation over multiple thresholds. In the UI, it is possible to toggle the baseline slice and investigate the performance of various other metrics. The user can also add their own metrics for visualization, specific to their use case.

Furthermore, Fairness Indicators is integrated with the What-If Tool (WIT) — clicking on a bar in the Fairness Indicators graph will load those specific data points into the the WIT widget for further inspection, comparison, and counterfactual analysis. This is particularly useful for large datasets, where Fairness Indicators can be used to identify problematic slices before the WIT is used for a deeper analysis.

Using Fairness Indicators to visualize metrics for fairness evaluation.

Clicking on a slice in Fairness Indicators will load all the data points in that slice inside the What-If Tool widget. In this case, all data points with the “female” label are shown.

The Fairness Indicators beta launch includes the following:

pip package: Includes Tensorflow Model Analysis (TFMA), Fairness Indicators, Tensorflow Data Validation (TFDV), What-If Tool, and example Colabs:
- Fairness Indicators Example Colab — an introduction to Fairness Indicators usage
- Fairness Indicators for TensorBoard — a TensorBoard plug-in usage example
- Fairness Indicators with TFHub Embeddings — a Colab that investigates the effects of different embeddings on downstream fairness metrics
- Fairness Indicators with Cloud Vision API's Face Detection Model — a Colab showing how Fairness Indicators can be used to generate evaluation results for model cards
GitHub repository: Source code
Guidance for usage: Fairness is highly contextual, and it’s important to carefully think through each use case and potential implications for users. This document provides guidance for selecting groups and metrics, and highlights evaluation best practices.
Case Study: Interactive case study on using Fairness Indicators, showing how Jigsaw's Conversation AI team detects bias in a classification model using the Toxic Comment Classification dataset.

How To Use Fairness Indicators in Models Today
Fairness Indicators is built on top of TensorFlow Model Analysis, a component of TensorFlow Extended (TFX) that can be used to investigate and visualize model performance. Based on the specific ML workflow, Fairness Indicators can be incorporated into a system in one of the following ways:
If using TensorFlow models and tools, such as TFX:

Access Fairness Indicators as part of the Evaluator component in TFX
Access Fairness Indicators in TensorBoard when evaluating other real-time metrics

If not using existing TensorFlow tools:

Download the Fairness Indicators pip package, and use Tensorflow Model Analysis as a standalone tool

For non-TensorFlow models:

Use Model Agnostic TFMA to compute Fairness Indicators based on the output of any model

Fairness Indicators Case Study
We created a case study and introductory video that illustrates how Fairness Indicators can be used with a combination of tools to detect and mitigate bias in a model trained on Jigsaw’s Unintended Bias in Toxicity dataset. The dataset was developed by Conversation AI, a team within Jigsaw that works to train ML models to protect voices in conversation. Models are trained to predict whether text comments are likely to be abusive along a variety of dimensions including toxicity, insult, and sexual explicitness.

The primary use case for models such as these is content moderation. If a model penalizes certain types of messages in a systematic way (e.g., often marks comments as toxic when they are not, leading to a high false positive rate), those voices will be silenced. In the case study, we investigated false positive rate on subgroups sliced by gender identity keywords that are present in the dataset, using a combination of tools (Fairness Indicators, TFDV, and WIT) to detect, diagnose, and take steps toward remediating the underlying problem.

What’s next?
Fairness Indicators is only the first step. We plan to expand vertically by enabling more supported metrics, such as metrics that enable you to evaluate classifiers without thresholds, and horizontally by creating remediation libraries that utilize methods, such as active learning and min-diff. Because we believe it is important to learn through real examples, we hope to ground our work in more case studies to be released over the next few months, as more features become available.

To get started, see the Fairness Indicators GitHub repo. For more information on how to think about fairness evaluation in the context of your use case, see this link.

We would love to partner with you to understand where Fairness Indicators is most useful, and where added functionality would be valuable. Please reach out at [email protected] to provide any feedback on your experience!

Acknowledgements
The core team behind this work includes Christina Greer, Manasi Joshi, Huanming Fang, Shivam Jindal, Karan Shukla, Osman Aka, Sanders Kleinfeld, Alicia Chang, Alex Hanna, and Dan Nanas. We would also like to thank James Wexler, Mahima Pushkarna, Meg Mitchell and Ben Hutchinson for their contributions to the project.

Source: Google AI Blog

Parrotron: New Research into Improving Verbal Communication for People with Speech Impairments

Posted by Fadi Biadsy, Research Scientist and Ron Weiss, Software Engineer, Google Research

Most people take for granted that when they speak, they will be heard and understood. But for the millions who live with speech impairments caused by physical or neurological conditions, trying to communicate with others can be difficult and lead to frustration. While there have been a great number of recent advances in automatic speech recognition (ASR; a.k.a. speech-to-text) technologies, these interfaces can be inaccessible for those with speech impairments. Further, applications that rely on speech recognition as input for text-to-speech synthesis (TTS) can exhibit word substitution, deletion, and insertion errors. Critically, in today’s technological environment, limited access to speech interfaces, such as digital assistants that depend on directly understanding one's speech, means being excluded from state-of-the-art tools and experiences, widening the gap between what those with and without speech impairments can access.

Project Euphonia has demonstrated that speech recognition models can be significantly improved to better transcribe a variety of atypical and dysarthric speech. Today, we are presenting Parrotron, an ongoing research project that continues and extends our effort to build speech technologies to help those with impaired or atypical speech to be understood by both people and devices. Parrotron consists of a single end-to-end deep neural network trained to convert speech from a speaker with atypical speech patterns directly into fluent synthesized speech, without an intermediate step of generating text—skipping speech recognition altogether. Parrotron’s approach is speech-centric, looking at the problem only from the point of view of speech signals—e.g., without visual cues such as lip movements. Through this work, we show that Parrotron can help people with a variety of atypical speech patterns—including those with ALS, deafness, and muscular dystrophy—to be better understood in both human-to-human interactions and by ASR engines.

The Parrotron Speech Conversion Model
Parrotron is an attention-based sequence-to-sequence model trained in two phases using parallel corpora of input/output speech pairs. First, we build a general speech-to-speech conversion model for standard fluent speech, followed by a personalization phase that adjusts the model parameters to the atypical speech patterns from the target speaker. The primary challenge in such a configuration lies in the collection of the parallel training data needed for supervised training, which consists of utterances spoken by many speakers and mapped to the same output speech content spoken by a single speaker. Since it is impractical to have a single speaker record the many hours of training data needed to build a high quality model, Parrotron uses parallel data automatically derived with a TTS system. This allows us to make use of a pre-existing anonymized, transcribed speech recognition corpus to obtain training targets.

The first training phase uses a corpus of ~30,000 hours that consists of millions of anonymized utterance pairs. Each pair includes a natural utterance paired with an automatically synthesized speech utterance that results from running our state-of-the-art Parallel WaveNet TTS system on the transcript of the first. This dataset includes utterances from thousands of speakers spanning hundreds of dialects/accents and acoustic conditions, allowing us to model a large variety of voices, linguistic and non-linguistic contents, accents, and noise conditions with “typical” speech all in the same language. The resulting conversion model projects away all non-linguistic information, including speaker characteristics, and retains only what is being said, not who, where, or how it is said. This base model is used to seed the second personalization phase of training.

The second training phase utilizes a corpus of utterance pairs generated in the same manner as the first dataset. In this case, however, the corpus is used to adapt the network to the acoustic/phonetic, phonotactic and language patterns specific to the input speaker, which might include, for example, learning how the target speaker alters, substitutes, and reduces or removes certain vowels or consonants. To model ALS speech characteristics in general, we use utterances taken from an ALS speech corpus derived from Project Euphonia. If instead we want to personalize the model for a particular speaker, then the utterances are contributed by that person. The larger this corpus is, the better the model is likely to be at correctly converting to fluent speech. Using this second smaller and personalized parallel corpus, we run the neural-training algorithm, updating the parameters of the pre-trained base model to generate the final personalized model.

We found that training the model with a multitask objective to predict the target phonemes while simultaneously generating spectrograms of the target speech led to significant quality improvements. Such a multitask trained encoder can be thought of as learning a latent representation of the input that maintains information about the underlying linguistic content.

Overview of the Parrotron model architecture. An input speech spectrogram is passed through encoder and decoder neural networks to generate an output spectrogram in a new voice.

Case Studies
To demonstrate a proof of concept, we worked with our fellow Google research scientist and mathematician Dimitri Kanevsky, who was born in Russia to Russian speaking, normal-hearing parents but has been profoundly deaf from a very young age. He learned to speak English as a teenager, by using Russian phonetic representations of English words, learning to pronounce English using transliteration into Russian (e.g., The quick brown fox jumps over the lazy dog => ЗИ КВИК БРАУН ДОГ ЖАМПС ОУВЕР ЛАЙЗИ ДОГ). As a result, Dimitri’s speech is substantially distinct from native English speakers, and can be challenging to comprehend for systems or listeners who are not accustomed to it.

Dimitri recorded a corpus of 15 hours of speech, which was used to adapt the base model to the nuances specific to his speech. The resulting Parrotron system helped him be better understood by both people and Google’s ASR system alike. Running Google’s ASR engine on the output of Parrotron significantly reduced the word error rate from 89% to 32%, on a held out test set from Dimitri. Below is an example of Parrotron’s successful conversion of input speech from Dimitri:

Input from Dimitri	Audio
Output from Parrotron	Audio

We also worked with Aubrie Lee, a Googler and advocate for disability inclusion, who has muscular dystrophy, a condition that causes progressive muscle weakness, and sometimes impacts speech production. Aubrie contributed 1.5 hours of speech, which has been instrumental in showing promising outcomes of the applicability of this speech-to-speech technology. Below is an example of Parrotron’s successful conversion of input speech from Aubrie:

Input from Aubrie	Audio
Output from Parrotron	Audio




Input from Aubrie	Audio
Output from Parrotron	Audio

We also tested Parrotron’s performance on speech from speakers with ALS by adapting the pretrained model on multiple speakers who share similar speech characteristics grouped together, rather than on a single speaker. We conducted a preliminary listening study and observed an increase in intelligibility when comparing natural ALS speech to the corresponding speech obtained from running the Parroton model, for the majority of our test speakers.

Cascaded Approach
Project Euphonia has built a personalized speech-to-text model that has reduced the word error rate for a deaf speaker from 89% to 25%, and ongoing research is also likely to improve upon these results. One could use such a speech-to-text model to achieve a similar goal as Parrotron by simply passing its output into a TTS system to synthesize speech from the result. In such a cascaded approach, however, the recognizer may choose an incorrect word (roughly 1 out 4 times, in this case)—i.e., it may yield words/sentences with unintended meaning and, as a result, the synthesized audio of these words would be far from the speaker’s intention. Given the end-to-end speech-to-speech training objective function of Parrotron, even when errors are made, the generated output speech is likely to sound acoustically similar to the input speech, and thus the speaker’s original intention is less likely to be significantly altered and it is often still possible to understand what is intended:

Input from Dimitri	Audio
Output from Parrotron	Audio




Input from Dimitri	Audio
Output from Parrotron/Input to Assistant	Audio
Output from Assistant	Audio




Input from Aubrie	Audio
Output from Parrotron	Audio

Furthermore, since Parrotron is not strongly biased to producing words from a predefined vocabulary set, input to the model may contain completely new invented words, foreign words/names, and even nonsense words. We observe that feeding Arabic and Spanish utterances into the US-English Parrotron model often results in output which echoes the original speech content with an American accent, in the target voice. Such behavior is qualitatively different from what one would obtain by simply running an ASR followed by a TTS. Finally, by going from a combination of independently tuned neural networks to a single one, we also believe there are improvements and simplifications that could be substantial.

Conclusion
Parrotron makes it easier for users with atypical speech to talk to and be understood by other people and by speech interfaces, with its end-to-end speech conversion approach more likely to reproduce the user’s intended speech. More exciting applications of Parrotron are discussed in our paper and additional audio samples can be found on our github repository. If you would like to participate in this ongoing research, please fill out this short form and volunteer to record a set of phrases. We look forward to working with you!

Acknowledgements
This project was joint work between the Speech and Google Brain teams. Contributors include Fadi Biadsy, Ron Weiss, Pedro Moreno, Dimitri Kanevsky, Ye Jia, Suzan Schwartz, Landis Baker, Zelin Wu, Johan Schalkwyk, Yonghui Wu, Zhifeng Chen, Patrick Nguyen, Aubrie Lee, Andrew Rosenberg, Bhuvana Ramabhadran, Jason Pelecanos, Julie Cattiau, Michael Brenner, Dotan Emanuel and Joel Shor. Our data collection efforts have been vastly accelerated by our collaborations with ALS-TDI.

Source: Google AI Blog

Providing Gender-Specific Translations in Google Translate

Posted by Melvin Johnson, Senior Software Engineer, Google Translate

Over the past few years, Google Translate has made significant improvements to translation quality by switching to an end-to-end neural network-based system. At the same time, we realized that translations from our models can reflect societal biases, such as gender bias. Specifically, languages differ a lot in how they represent gender, and when there are ambiguities during translation, the systems tend to pick gender choices that reflect societal asymmetries, resulting in biased translations. For instance, Google Translate historically translated the Turkish equivalent of “He/she is a doctor” into the masculine form, and the Turkish equivalent of “He/she is a nurse” into the feminine form.

Recently, we announced that we’re taking the first step at reducing gender bias in our translations. We now provide both feminine and masculine translations when translating single-word queries from English to four different languages (French, Italian, Portuguese, and Spanish), and when translating phrases and sentences from Turkish to English.

Gender-specific translations on the Google Translate website.

Supporting gender-specific translations for single-word queries involved enriching our underlying dictionary with gender attributes. Supporting gender-specific translations for longer queries (phrases and sentences) was particularly challenging and involved making significant changes to our translation framework. For these longer queries, we focused initially on Turkish-to-English translation. We developed a three-step approach to solve the problem of providing a masculine and feminine translation in English for a gender-neutral query in Turkish.

Detecting Gender-Neutral Queries
Many Turkish sentences that refer to people are gender-neutral, but not all are. Detecting which queries are eligible for gender-specific translations is a hard problem because Turkish is morphologically complex, meaning that reference to a person can either be explicit with a gender-neutral pronoun (e.g. O, Ona) or implicitly encoded. For example, the sentence “Biliyor mu?” has no explicit gender-neutral pronoun but can be translated as either “Does she know?” or “Does he know?”. This complexity means that we cannot use a simple list of gender-neutral pronouns to detect gender-neutral Turkish queries and need a machine-learned system. We estimate that approximately 10% of Turkish Translate queries are ambiguous, and eligible for both feminine and masculine translations.

To detect these queries, we use state-of-the-art text classification algorithms (same as those used in our Cloud Natural Language API) to build a system that is able to detect when a given Turkish query is gender-neutral. Since this introduces an additional step before obtaining the translations, we had to carefully balance model complexity with latency. We trained our system on thousands of human-rated Turkish examples, where raters were asked to judge whether a given example is gender-neutral or not. Our final classification system is a convolutional neural network that can accurately detect queries which require gender-specific translations.

Generating Gender-Specific Translations
Next, we enhanced our underlying Neural Machine Translation (NMT) system to produce feminine and masculine translations when requested. When no gender is requested, we trained the model to produce the default translation. This involved:

Identifying and dividing our parallel training data into those with feminine words, those with masculine and those with ungendered words.
Adding an additional input token to the beginning of the sentence to specify the required gender to translate to, similar to how we build multilingual NMT systems:

<2MALE> O bir doktor → He is a doctor
<2FEMALE> O bir doktor → She is a doctor

Training our enhanced NMT model on the feminine, masculine and ungendered data sources. We experimented with various mixing ratios for these sources to enable the model to perform equally well on the three tasks.

If a user's query is determined to be gender-neutral, we add a gender prefix to the translation request. For these requests, our final NMT model can reliably produce feminine and masculine translations 99% of the time. Additionally, the system maintains translation quality on queries without the gender prefix.

Checking for Accuracy
Finally, we have a step that decides whether to display the gender-specific translations. Since the training data that produces the masculine translation is different from the training data that produces the feminine translation, there may be differences between the two translations unrelated to gender. If the gender-specific translations are determined to be low quality, we show only the single default translation. To determine the quality of the gender-specific translations, we verify:

If the requested feminine translation is feminine.
If the requested masculine translation is masculine.
If the feminine and masculine translations are exactly equivalent with the exception of gender-related changes. Even minor changes in the wording between the translations will result in being filtered.

Top: The masculine and feminine translations differ only with respect to gender i.e. “he” and “his” vs “she” and “her”. Hence, we will show gender-specific translations. Bottom: The masculine and feminine translations differ correctly with respect to gender i.e. “he” vs “she”. However, the change from “really” to “actually” is not related to gender. Hence, we will filter gender-specific translations and display the default translation.

Putting it all together, input sentences first go through the classifier, which detects whether they’re eligible for gender-specific translations. If the classifier says “yes”, we send three requests to our enhanced NMT model—a feminine request, a masculine request and an ungendered request. Our final step takes into account all three responses and decides whether to display gender-specific translations or a single default translation. This step is still quite conservative in order to maximize the quality of gender-specific translations shown; hence our overall recall is only around 60%. We plan to increase our coverage and add support for more complex sentences in future iterations.

This is just the first step toward addressing gender bias in machine-translation systems and reiterates Google’s commitment to fairness in machine learning. In the future, we plan to extend gender-specific translations to more languages and to address non-binary gender in translations.

Acknowledgements:
This effort has been successful thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Lindsey Boran, HyunJeong Choe, Héctor Fernández Alcalde, Orhan Firat, Qin Gao, Rick Genter, Macduff Hughes, Tolga Kayadelen, James Kuczmarski, Tatiana Lando, Liu Liu, Michael Mandl, Nihal Meriç Atilla, Mengmeng Niu, Adnan Ozturel, Emily Pitler, Kathy Ray, John Richardson, Larissa Rinaldi, Alex Rudnick, Apu Shah, Jason Smith, Antonio Stella, Romina Stella, Jana Strnadova, Katrin Tomanek, Barak Turovsky, Dan Schwarz, Shilp Vaishnav, Clayton Watts, Kellie Webster, Colin Young, Pendar Yousefi, Candice Zhang and Min Zhao.

googblogs.com

All Google blogs and Press in one site

Tag Archives: ML Fairness

Introducing the Model Card Toolkit for Easier Model Transparency Reporting

Source: Google AI Blog

Setting Fairness Goals with the TensorFlow Constrained Optimization Library

Source: Google AI Blog

ML-fairness-gym: A Tool for Exploring Long-Term Impacts of Machine Learning Systems

Source: Google AI Blog

Fairness Indicators: Scalable Infrastructure for Fair ML Systems

Source: Google AI Blog

Parrotron: New Research into Improving Verbal Communication for People with Speech Impairments

Source: Google AI Blog

Providing Gender-Specific Translations in Google Translate

Source: Google AI Blog