RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

The ability to detect objects in the visual world is crucial for computer vision and machine intelligence, enabling applications like adaptive autonomous agents and versatile shopping systems. However, modern object detectors are limited by the manual annotations of their training data, resulting in a vocabulary size significantly smaller than the vast array of objects encountered in reality. To overcome this, the open-vocabulary detection task (OVD) has emerged, utilizing image-text pairs for training and incorporating new category names at test time by associating them with the image content. By treating categories as text embeddings, open-vocabulary detectors can predict a wide range of unseen objects. Various techniques such as image-text pre-training, knowledge distillation, pseudo labeling, and frozen models, often employing convolutional neural network (CNN) backbones, have been proposed. With the growing popularity of vision transformers (ViTs), it is important to explore their potential for building proficient open-vocabulary detectors.

The existing approaches assume the availability of pre-trained vision-language models (VLMs) and focus on fine-tuning or distillation from these models to address the disparity between image-level pre-training and object-level fine-tuning. However, as VLMs are primarily designed for image-level tasks like classification and retrieval, they do not fully leverage the concept of objects or regions during the pre-training phase. Thus, it could be beneficial for open-vocabulary detection if we build locality information into the image-text pre-training.

In “RO-ViT: Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers”, presented at CVPR 2023, we introduce a simple method to pre-train vision transformers in a region-aware manner to improve open-vocabulary detection. In vision transformers, positional embeddings are added to image patches to encode information about the spatial position of each patch within the image. Standard pre-training typically uses full-image positional embeddings, which does not generalize well to detection tasks. Thus, we propose a new positional embedding scheme, called “cropped positional embedding”, that better aligns with the use of region crops in detection fine-tuning. In addition, we replace the softmax cross entropy loss with focal loss in contrastive image-text learning, allowing us to learn from more challenging and informative examples. Finally, we leverage recent advances in novel object proposals to enhance open-vocabulary detection fine-tuning, which is motivated by the observation that existing methods often miss novel objects during the proposal stage due to overfitting to foreground categories. We are also releasing the code here.


Region-aware image-text pre-training

Existing VLMs are trained to match an image as a whole to a text description. However, we observe there is a mismatch between the way the positional embeddings are used in the existing contrastive pre-training approaches and open-vocabulary detection. The positional embeddings are important to transformers as they provide the information of where each element in the set comes from. This information is often useful for downstream recognition and localization tasks. Pre-training approaches typically apply full-image positional embeddings during training, and use the same positional embeddings for downstream tasks, e.g., zero-shot recognition. However, the recognition occurs at region-level for open-vocabulary detection fine-tuning, which requires the full-image positional embeddings to generalize to regions that they never see during the pre-training.

To address this, we propose cropped positional embeddings (CPE). With CPE, we upsample positional embeddings from the image size typical for pre-training, e.g., 224x224 pixels, to that typical for detection tasks, e.g., 1024x1024 pixels. Then we randomly crop and resize a region, and use it as the image-level positional embeddings during pre-training. The position, scale, and aspect ratio of the crop is randomly sampled. Intuitively, this causes the model to view an image not as a full image in itself, but as a region crop from some larger unknown image. This better matches the downstream use case of detection where recognition occurs at region- rather than image-level.

For the pre-training, we propose cropped positional embedding (CPE) which randomly crops and resizes a region of positional embeddings instead of using the whole-image positional embedding (PE). In addition, we use focal loss instead of the common softmax cross entropy loss for contrastive learning.

We also find it beneficial to learn from hard examples with a focal loss. Focal loss enables finer control over how hard examples are weighted than what the softmax cross entropy loss can provide. We adopt the focal loss and replace it with the softmax cross entropy loss in both image-to-text and text-to-image losses. Both CPE and focal loss introduce no extra parameters and minimal computation costs.


Open-vocabulary detector fine-tuning

An open-vocabulary detector is trained with the detection labels of ‘base’ categories, but needs to detect the union of ‘base’ and ‘novel’ (unlabeled) categories at test time. Despite the backbone features pre-trained from the vast open-vocabulary data, the added detector layers (neck and heads) are newly trained with the downstream detection dataset. Existing approaches often miss novel/unlabeled objects in the object proposal stage because the proposals tend to classify them as background. To remedy this, we leverage recent advances in a novel object proposal method and adopt the localization quality-based objectness (i.e., centerness score) instead of object-or-not binary classification score, which is combined with the detection score. During training, we compute the detection scores for each detected region as the cosine similarity between the region’s embedding (computed via RoI-Align operation) and the text embeddings of the base categories. At test time, we append the text embeddings of novel categories, and the detection score is now computed with the union of the base and novel categories.

The pre-trained ViT backbone is transferred to the downstream open-vocabulary detection by replacing the global average pooling with detector heads. The RoI-Align embeddings are matched with the cached category embeddings to obtain the VLM score, which is combined with the detection score into the open-vocabulary detection score.

Results

We evaluate RO-ViT on the LVIS open-vocabulary detection benchmark. At the system-level, our best model achieves 33.6 box average precision on rare categories (APr) and 32.1 mask APr, which outperforms the best existing ViT-based approach OWL-ViT by 8.0 APr and the best CNN-based approach ViLD-Ens by 5.8 mask APr. It also exceeds the performance of many other approaches based on knowledge distillation, pre-training, or joint training with weak supervision.

RO-ViT outperforms both the state-of-the-art (SOTA) ViT-based and CNN-based methods on LVIS open-vocabulary detection benchmark. We show mask AP on rare categories (APr) , except for SOTA ViT-based (OwL-ViT) where we show box AP.

Apart from evaluating region-level representation through open-vocabulary detection, we evaluate the image-level representation of RO-ViT in image-text retrieval through the MS-COCO and Flickr30K benchmarks. Our model with 303M ViT outperforms the state-of-the-art CoCa model with 1B ViT on MS COCO, and is on par on Flickr30K. This shows that our pre-training method not only improves the region-level representation but also the global image-level representation for retrieval.

We show zero-shot image-text retrieval on MS COCO and Flickr30K benchmarks, and compare with dual-encoder methods. We report recall@1 (top-1 recall) on image-to-text (I2T) and text-to-image (T2I) retrieval tasks. RO-ViT outperforms the state-of-the-art CoCa with the same backbone.
RO-ViT open-vocabulary detection on LVIS. We only show the novel categories for clarity. RO-ViT detects many novel categories that it has never seen during detection training: “fishbowl”, “sombrero”, “persimmon”, “gargoyle”.

Visualization of positional embeddings

We visualize and compare the learned positional embeddings of RO-ViT with the baseline. Each tile is the cosine similarity between positional embeddings of one patch and all other patches. For example, the tile in the top-left corner (marked in red) visualizes the similarity between the positional embedding of the location (row=1, column=1) and those positional embeddings of all other locations in 2D. The brightness of the patch indicates how close the learned positional embeddings of different locations are. RO-ViT forms more distinct clusters at different patch locations showing symmetrical global patterns around the center patch.

Each tile shows the cosine similarity between the positional embedding of the patch (at the indicated row-column position) and the positional embeddings of all other patches. ViT-B/16 backbone is used.

Conclusion

We present RO-ViT, a contrastive image-text pre-training framework to bridge the gap between image-level pre-training and open-vocabulary detection fine-tuning. Our methods are simple, scalable, and easy to apply to any contrastive backbones with minimal computation overhead and no increase in parameters. RO-ViT achieves the state-of-the-art on LVIS open-vocabulary detection benchmark and on the image-text retrieval benchmarks, showing the learned representation is not only beneficial at region-level but also highly effective at the image-level. We hope this study can help the research on open-vocabulary detection from the perspective of image-text pre-training which can benefit both region-level and image-level tasks.


Acknowledgements

Dahun Kim, Anelia Angelova, and Weicheng Kuo conducted this work and are now at Google DeepMind. We would like to thank our colleagues at Google Research for their advice and helpful discussions.

Source: Google AI Blog


Google Ads API v12 sunset reminder

Google Ads API v12 will sunset on September 27, 2023. After this date, all v12 API requests will begin to fail. Please migrate to a newer version before September 27, 2023 to ensure your API access is unaffected.

We've prepared various resources to help you with the migration: In addition, using the Google Cloud Console, you can view the list of methods and services to which your project recently submitted requests:
  1. Open the Dashboard page (found under APIs & Services) in the Google Cloud Console.
  2. Click on Google Ads API in the table.
  3. On the METRICS subtab, you should see your recent requests plotted on each graph. At the bottom of the page, you’ll see the Methods table, where you can see which methods you’ve sent requests to. The method name includes a Google Ads API version, a service, and a method name, e.g., google.ads.googleads.v12.services.GoogleAdsService.Mutate. In this way, you can see all versions that you’ve used recently.
  4. (Optional) Click on the time frame at the top right of the page if you need to change it.
If you have questions while you’re upgrading, please reach out to us on the forum or through [email protected].

Latinitas conecta a la próxima generación de mujeres líderes a Internet de alta velocidad en el centro de Texas con el apoyo de Google Fiber

GFiber trabaja con organizaciones de todo el país para conectar a más personas con los beneficios de Internet de calidad. En Austin, Latinitas ayuda a las mujeres jóvenes a desarrollar las habilidades necesarias para vivir en un mundo digital. La directora de comunicaciones, Salwa Yordi, comparte su historia.


Nuestra misión en Latinitas es empoderar a todas las niñas para que innoven a través de los medios de comunicación y la tecnología. Brindamos programación en persona y virtual para que las estudiantes se expresen, desarrollen habilidades tecnológicas, aprendan sobre su cultura y descubran su voz única. Cada verano la Latinitas organiza “Club Latinitas” para niños de ocho a 18 años para facilitar clases sobre creatividad digital y codificación. Sin duda, se necesita Internet confiable de alta velocidad para apoyar adecuadamente a estas niñas y sus proyectos innovadores.


Thumbnail

Como Conexión Comunitaria de Google Fiber, Latinitas ha confiado en nuestro servicio de Google Fiber para brindar actividades enriquecedoras y recursos educativos que se han integrado perfectamente en las actividades de nuestro campamento, incluyendo talleres de transmisión en vivo con oradores invitados de alrededor de todo el país.


Según la Liga Nacional de Ciudades, solo el 65% de las personas Hispanas que viven en los Estados Unidos tienen acceso a una forma de conexión de banda ancha en comparación con las personas Blancas (80%) y Negras (71%). En Texas, más del 15% de los hogares carecen de banda ancha en cinco condados del centro de Texas (Bastrop, Fayette, Lee, Caldwell y Mason), según estimaciones de la Oficina del Censo de EE. UU. de 2017 a 2021. Además de la falta de acceso, también hay una falta de habilidades digitales dentro de la comunidad Latina. Para abordar esta necesidad, Google Fiber ha sido un socio constante en la financiación de las clases digitales para padres de Latinitas durante los últimos años.


Por primera vez, cada campista en Latinitas tiene la capacidad de interactuar con profesionales de la industria en sesiones virtuales en tiempo real, participar en talleres dinámicos y colaborar en proyectos multimedia sin interrupciones debido a la conectividad. Además, la confiabilidad de Internet también ha facilitado la investigación en línea, permitiendo a las niñas a profundizarse en áreas de interés y ultimamente desarrollar su confianza.



Latinitas se compromete a fomentar una generación de líderes expertos en tecnología. Este es un compromiso a largo plazo. Conectar a nuestra comunidad garantiza que la misión y la visión de nuestra organización permanezcan accesibles para todos, y estamos agradecidas por socios como Google Fiber que están ayudando a nuestras futuras generaciones a prosperar.


Publicado por Salwa Yordi, Directora de Comunicaciones de Latinitas



Latinitas connects the next generation of women leaders to high speed internet in Central Texas with support from Google Fiber

GFiber works with organizations across the country to connect more people to the benefits of quality internet. In Austin, Latinitas helps young women develop the necessary skills for living in a digital world. Communications Director Salwa Yordi shares their story.


Our mission at Latinitas is to empower all girls to innovate through media and technology. We provide in-person and virtual programming for students to express themselves, develop tech skills, learn about their culture, and discover their unique voice. Every summer the organization hosts “Club Latinitas” for kids ages eight to 18 to facilitate classes on digital creativity and coding. Without question, reliable high-speed internet is necessary in order to adequately support these children and their innovative projects.


Thumbnail

As a Google Fiber Community Connection, Latinitas has relied on our Google Fiber service to provide enriching activities and educational resources that have been seamlessly integrated into our camp activities, including live streaming workshops with guest speakers from around the country. 


According to the National League of Cities, only 65% of Hispanic individuals living in the United States have access to a form of broadband connection compared to white (80%) and Black (71%) individuals. In Texas, more than 15% of households lacked broadband in five Central Texas counties (Bastrop, Fayette, Lee, Caldwell and Mason), according to U.S. Census Bureau estimates from 2017 to 2021. In addition to the lack of access, there’s also a lack of digital skills within the Latino community. To address this need Google Fiber has been an ongoing partner in funding Latinitas Digital Parents classes for the past few years. 


For the first time, every camper at Latinitas has the ability to interact with industry professionals in real-time virtual sessions, participate in dynamic workshops and collaborate on multimedia projects without disruption due to connectivity. Additionally, the internet's reliability has also facilitated smooth online research, empowering kids to delve deeper into areas of interest and ultimately building up their confidence. 



Latinitas is committed to fostering a generation of tech-savvy leaders. This is a long-term commitment. Connecting our community ensures the mission and vision of our organization remains accessible to all, and we are thankful for partners like Google Fiber who are helping our future generations thrive.


Posted by Salwa Yordi, Communications Director at Latinitas     



Stable Channel Update for ChromeOS / ChromeOS Flex

 Hello All,


The Stable channel is being updated to 116.0.5845.120 (Platform version: 15509.63.0) for most ChromeOS devices and will be rolled out over the next few days.

If you find new issues, please let us know one of the following ways:

Interested in switching channels? Find out how.

See release notes.

Security Fixes and Rewards:

VRP Reported Security Fixes:

Note: Access to bug details and links may be kept restricted until a majority of users are updated with a fix. We will also retain restrictions if the bug exists in a third party library that other projects similarly depend on, but haven’t yet fixed.

[$TBD] [1464456] Medium CVE-2023-4369 XSS on ChromeOS, abusable by extensions. Reported by Derin Eryilmaz.

[$TBD] [1443214] Low CVE-TBD Extension abuse in ChromeOS. Reported by Allen Ding



3rd Party Reported Security Fixes:


[NA]  [NA] High Fixes CVE-2023-20593 on impacted AMD platforms

[NA]  [NA] High Fixes CVE-2023-4211 on impacted Arm platforms

[NA]  [NA] High Fixes CVE-2023-4128 in Linux Kernel

[NA]  [NA] High Fixes CVE-2023-4147 in Linux Kernel

[NA]  [NA] High Fixes CVE-2023-3390 in Linux Kernel

[NA]  [NA] High Fixes CVE-2023-32804 in Arm Mali Driver Development Kit


Chrome Browser Security Fixes:


[$30000][1448548] High CVE-2023-2312: Use after free in Offline. Reported by avaue at S.S.L. on 2023-05-24

[$5000][1458303] High CVE-2023-4349: Use after free in Device Trust Connectors. Reported by Weipeng Jiang (@Krace) of VRI on 2023-06-27

[$3000][1454817] High CVE-2023-4350: Inappropriate implementation in Fullscreen. Reported by Khiem Tran (@duckhiem) on 2023-06-14

[$2000][1465833] High CVE-2023-4351: Use after free in Network. Reported by Guang and Weipeng Jiang of VRI on 2023-07-18

[$NA][1452076] High CVE-2023-4352: Type Confusion in V8. Reported by Sergei Glazunov of Google Project Zero on 2023-06-07

[$NA][1458046] High CVE-2023-4353: Heap buffer overflow in ANGLE. Reported by Christoph Diehl / Microsoft Vulnerability Research on 2023-06-27

[$NA][1464215] High CVE-2023-4354: Heap buffer overflow in Skia. Reported by Mark Brand of Google Project Zero on 2023-07-12

[$NA][1468943] High CVE-2023-4355: Out of bounds memory access in V8. Reported by Sergei Glazunov of Google Project Zero on 2023-07-31

[$5000][1449929] Medium CVE-2023-4356: Use after free in Audio. Reported by Zhenghang Xiao (@Kipreyyy) on 2023-05-30

[$3000][1458911] Medium CVE-2023-4357: Insufficient validation of untrusted input in XML. Reported by Igor Sak-Sakovskii on 2023-06-28

[$3000][1466415] Medium CVE-2023-4358: Use after free in DNS. Reported by Weipeng Jiang (@Krace) of VRI on 2023-07-20

[$2000][1443722] Medium CVE-2023-4359: Inappropriate implementation in App Launcher. Reported by @retsew0x01 on 2023-05-09

[$2000][1462723] Medium CVE-2023-4360: Inappropriate implementation in Color. Reported by Axel Chong on 2023-07-07

[$2000][1465230] Medium CVE-2023-4361: Inappropriate implementation in Autofill. Reported by Thomas Orlita on 2023-07-17

[$1000][1316379] Medium CVE-2023-4362: Heap buffer overflow in Mojom IDL. Reported by Zhao Hai of NanJing Cyberpeace TianYu Lab on 2022-04-14

[$1000][1367085] Medium CVE-2023-4363: Inappropriate implementation in WebShare. Reported by Alesandro Ortiz on 2022-09-23

[$1000][1406922] Medium CVE-2023-4364: Inappropriate implementation in Permission Prompts. Reported by Jasper Rebane on 2023-01-13

[$1000][1431043] Medium CVE-2023-4365: Inappropriate implementation in Fullscreen. Reported by Hafiizh on 2023-04-06

[$1000][1450784] Medium CVE-2023-4366: Use after free in Extensions. Reported by asnine on 2023-06-02

[$500][1467743] Medium CVE-2023-4367: Insufficient policy enforcement in Extensions API. Reported by Axel Chong on 2023-07-26

[$500][1467751] Medium CVE-2023-4368: Insufficient policy enforcement in Extensions API. Reported by Axel Chong on 2023-07-26



Android Runtime Container Security Fixes:

[NA]  [NA] High Fixes CVE-2023-21264 on impacted platforms

[NA]  [NA] High Fixes CVE-2020-29374 on impacted platforms



We would like to thank the security researchers that report vulnerabilities to us via bughunters.google.com to keep ChromeOS and the entire open source ecosystem secure.


Google ChromeOS

Google Workspace Updates Weekly Recap – August 25, 2023

2 New updates 

Unless otherwise indicated, the features below are available to all Google Workspace customers, and are fully launched or in the process of rolling out. Rollouts should take no more than 15 business days to complete if launching to both Rapid and Scheduled Release at the same time. If not, each stage of rollout should take no more than 15 business days to complete.


Copy space member email address in Google Chat 
Space managers and members can now copy the email addresses of members in a space on Google Chat. This option is ON by default for spaces with 100 members or less. The option to copy space member email addresses will be disabled in spaces with 100+ members. 
Copy space member email address in Google Chat

Filter by expression for Connected Sheets for Looker 
You can now use common filter expressions from Looker such as “last 30 days”, “last quarter”, or “NOT 50” to filter on pivot tables in Connected Sheets for Looker. | Learn more about Connected Sheets


Previous announcements

The announcements below were published on the Workspace Updates blog earlier this week. Please refer to the original blog posts for complete details.


View speaker notes while co-presenting Google Slides in Google Meet
Co-presenters are now also able to view speaker notes. | Available to Google Workspace Business Standard, Business Plus, Enterprise Starter, Enterprise Essentials, Enterprise Standard, Enterprise Plus, Education Plus, the Teaching & Learning Upgrade, and Workspace Individual customers only. | Learn more about viewing speaker notes while co-presenting Google Slides in Google Meet. 

Stronger protection for additional sensitive actions taken in Gmail 
Last year, we introduced stronger safeguards around sensitive actions taken in your Google Workspace accounts. We’re extending these protections to sensitive actions taken in Gmail. | Learn more about stronger protection in Gmail

See message view counts in Google Chat spaces 
Space members can now see view counts for messages in all spaces. | Learn more about view counts in Google Chat spaces

View & compare script versions with Apps Script project history 
We're announcing project history, a new interface for developers to view previously deployed script versions and compare versions to the current script version. | Learn more about Apps Script project history. 

Displaying Microsoft Outlook users as organizers in Google Calendar 
Microsoft Outlook user who organize meetings are now listed amongst the other meeting attendees in Calendar as the meeting organizer. | Learn more Microsoft Outlook organizers in Google Calendar. 

Introducing Workday app for Google Chat 
We’re adding a new Workday app for Google Chat that allows you to perform quick actions in Workday, such as requesting time off, filing expense reports and looking up colleague's information, all without leaving Google Chat. | Learn more about the Workday app for Google Chat.

Join client-side encrypted meetings from your mobile device 
You can join a client-side encrypted meeting directly from the Google Meet and Calendar apps. | Available to Google Workspace Enterprise Plus, Education Standard, and Education Plus customers hosting client-side encrypted calls only. | Learn more about client-side encrypted meetings from your mobile device

Third-party app access enhancements for Google Workspace for Education 
All Google Workspace for Education Admins must review and confirm access settings for third-party configured apps that are currently accessible to your users by Oct 23, 2023 in order for users designated as under 18 to maintain access to those third-party apps. | Available to Education Fundamentals, Education Standard, Education Plus, and the Teaching and Learning Upgrade only. | Learn more about third-party app access. 

Enhance your Google Keep notes on Android with rich text formatting 
We’re adding rich text formatting options to new notes on Keep. This highly requested feature enables you to customize and add emphasis to your text through bolding, underlining, italicizing, and heading styles. | Learn more about rich text formatting on the Keep app.


Completed rollouts

The features below completed their rollouts to Rapid Release domains, Scheduled Release domains, or both. Please refer to the original blog posts for additional details.


Rapid Release Domains:

For a recap of announcements in the past six months, check out What’s new in Google Workspace (recent releases).

Responsible AI at Google Research: Perception Fairness

Google’s Responsible AI research is built on a foundation of collaboration — between teams with diverse backgrounds and expertise, between researchers and product developers, and ultimately with the community at large. The Perception Fairness team drives progress by combining deep subject-matter expertise in both computer vision and machine learning (ML) fairness with direct connections to the researchers building the perception systems that power products across Google and beyond. Together, we are working to intentionally design our systems to be inclusive from the ground up, guided by Google’s AI Principles.

Perception Fairness research spans the design, development, and deployment of advanced multimodal models including the latest foundation and generative models powering Google's products.

Our team's mission is to advance the frontiers of fairness and inclusion in multimodal ML systems, especially related to foundation models and generative AI. This encompasses core technology components including classification, localization, captioning, retrieval, visual question answering, text-to-image or text-to-video generation, and generative image and video editing. We believe that fairness and inclusion can and should be top-line performance goals for these applications. Our research is focused on unlocking novel analyses and mitigations that enable us to proactively design for these objectives throughout the development cycle. We answer core questions, such as: How can we use ML to responsibly and faithfully model human perception of demographic, cultural, and social identities in order to promote fairness and inclusion? What kinds of system biases (e.g., underperforming on images of people with certain skin tones) can we measure and how can we use these metrics to design better algorithms? How can we build more inclusive algorithms and systems and react quickly when failures occur?


Measuring representation of people in media

ML systems that can edit, curate or create images or videos can affect anyone exposed to their outputs, shaping or reinforcing the beliefs of viewers around the world. Research to reduce representational harms, such as reinforcing stereotypes or denigrating or erasing groups of people, requires a deep understanding of both the content and the societal context. It hinges on how different observers perceive themselves, their communities, or how others are represented. There's considerable debate in the field regarding which social categories should be studied with computational tools and how to do so responsibly. Our research focuses on working toward scalable solutions that are informed by sociology and social psychology, are aligned with human perception, embrace the subjective nature of the problem, and enable nuanced measurement and mitigation. One example is our research on differences in human perception and annotation of skin tone in images using the Monk Skin Tone scale.

Our tools are also used to study representation in large-scale content collections. Through our Media Understanding for Social Exploration (MUSE) project, we've partnered with academic researchers, nonprofit organizations, and major consumer brands to understand patterns in mainstream media and advertising content. We first published this work in 2017, with a co-authored study analyzing gender equity in Hollywood movies. Since then, we've increased the scale and depth of our analyses. In 2019, we released findings based on over 2.7 million YouTube advertisements. In the latest study, we examine representation across intersections of perceived gender presentation, perceived age, and skin tone in over twelve years of popular U.S. television shows. These studies provide insights for content creators and advertisers and further inform our own research.

An illustration (not actual data) of computational signals that can be analyzed at scale to reveal representational patterns in media collections. [Video Collection / Getty Images]

Moving forward, we're expanding the ML fairness concepts on which we focus and the domains in which they are responsibly applied. Looking beyond photorealistic images of people, we are working to develop tools that model the representation of communities and cultures in illustrations, abstract depictions of humanoid characters, and even images with no people in them at all. Finally, we need to reason about not just who is depicted, but how they are portrayed — what narrative is communicated through the surrounding image content, the accompanying text, and the broader cultural context.


Analyzing bias properties of perceptual systems

Building advanced ML systems is complex, with multiple stakeholders informing various criteria that decide product behavior. Overall quality has historically been defined and measured using summary statistics (like overall accuracy) over a test dataset as a proxy for user experience. But not all users experience products in the same way.

Perception Fairness enables practical measurement of nuanced system behavior beyond summary statistics, and makes these metrics core to the system quality that directly informs product behaviors and launch decisions. This is often much harder than it seems. Distilling complex bias issues (e.g., disparities in performance across intersectional subgroups or instances of stereotype reinforcement) to a small number of metrics without losing important nuance is extremely challenging. Another challenge is balancing the interplay between fairness metrics and other product metrics (e.g., user satisfaction, accuracy, latency), which are often phrased as conflicting despite being compatible. It is common for researchers to describe their work as optimizing an "accuracy-fairness" tradeoff when in reality widespread user satisfaction is aligned with meeting fairness and inclusion objectives.

We built and released the MIAP dataset as part of Open Images, leveraging our research on perception of socially relevant concepts and detection of biased behavior in complex systems to create a resource that furthers ML fairness research in computer vision. Original photo credits — left: Boston Public Library; middle: jen robinson; right: Garin Fons; all used with permission under the CC- BY 2.0 license.

To these ends, our team focuses on two broad research directions. First, democratizing access to well-understood and widely-applicable fairness analysis tooling, engaging partner organizations in adopting them into product workflows, and informing leadership across the company in interpreting results. This work includes developing broad benchmarks, curating widely-useful high-quality test datasets and tooling centered around techniques such as sliced analysis and counterfactual testing — often building on the core representation signals work described earlier. Second, advancing novel approaches towards fairness analytics — including partnering with product efforts that may result in breakthrough findings or inform launch strategy.


Advancing AI responsibly

Our work does not stop with analyzing model behavior. Rather, we use this as a jumping-off point for identifying algorithmic improvements in collaboration with other researchers and engineers on product teams. Over the past year we've launched upgraded components that power Search and Memories features in Google Photos, leading to more consistent performance and drastically improving robustness through added layers that keep mistakes from cascading through the system. We are working on improving ranking algorithms in Google Images to diversify representation. We updated algorithms that may reinforce historical stereotypes, using additional signals responsibly, such that it’s more likely for everyone to see themselves reflected in Search results and find what they're looking for.

This work naturally carries over to the world of generative AI, where models can create collections of images or videos seeded from image and text prompts and can answer questions about images and videos. We're excited about the potential of these technologies to deliver new experiences to users and as tools to further our own research. To enable this, we're collaborating across the research and responsible AI communities to develop guardrails that mitigate failure modes. We’re leveraging our tools for understanding representation to power scalable benchmarks that can be combined with human feedback, and investing in research from pre-training through deployment to steer the models to generate higher quality, more inclusive, and more controllable output. We want these models to inspire people, producing diverse outputs, translating concepts without relying on tropes or stereotypes, and providing consistent behaviors and responses across counterfactual variations of prompts.


Opportunities and ongoing work

Despite over a decade of focused work, the field of perception fairness technologies still seems like a nascent and fast-growing space, rife with opportunities for breakthrough techniques. We continue to see opportunities to contribute technical advances backed by interdisciplinary scholarship. The gap between what we can measure in images versus the underlying aspects of human identity and expression is large — closing this gap will require increasingly complex media analytics solutions. Data metrics that indicate true representation, situated in the appropriate context and heeding a diversity of viewpoints, remains an open challenge for us. Can we reach a point where we can reliably identify depictions of nuanced stereotypes, continually update them to reflect an ever-changing society, and discern situations in which they could be offensive? Algorithmic advances driven by human feedback point a promising path forward.

Recent focus on AI safety and ethics in the context of modern large model development has spurred new ways of thinking about measuring systemic biases. We are exploring multiple avenues to use these models — along with recent developments in concept-based explainability methods, causal inference methods, and cutting-edge UX research — to quantify and minimize undesired biased behaviors. We look forward to tackling the challenges ahead and developing technology that is built for everybody.


Acknowledgements

We would like to thank every member of the Perception Fairness team, and all of our collaborators.

Source: Google AI Blog