Pathdreamer: A World Model for Indoor Navigation

When a person navigates around an unfamiliar building, they take advantage of many visual, spatial and semantic cues to help them efficiently reach their goal. For example, even in an unfamiliar house, if they see a dining area, they can make intelligent predictions about the likely location of the kitchen and lounge areas, and therefore the expected location of common household objects. For robotic agents, taking advantage of semantic cues and statistical regularities in novel buildings is challenging. A typical approach is to implicitly learn what these cues are, and how to use them for navigation tasks, in an end-to-end manner via model-free reinforcement learning. However, navigation cues learned in this way are expensive to learn, hard to inspect, and difficult to re-use in another agent without learning again from scratch.

People navigating in unfamiliar buildings can take advantage of visual, spatial and semantic cues to predict what’s around a corner. A computational model with this capability is a visual world model.

An appealing alternative for robotic navigation and planning agents is to use a world model to encapsulate rich and meaningful information about their surroundings, which enables an agent to make specific predictions about actionable outcomes within their environment. Such models have seen widespread interest in robotics, simulation, and reinforcement learning with impressive results, including finding the first known solution for a simulated 2D car racing task, and achieving human-level performance in Atari games. However, game environments are still relatively simple compared to the complexity and diversity of real-world environments.

In “Pathdreamer: A World Model for Indoor Navigation”, published at ICCV 2021, we present a world model that generates high-resolution 360º visual observations of areas of a building unseen by an agent, using only limited seed observations and a proposed navigation trajectory. As illustrated in the video below, the Pathdreamer model can synthesize an immersive scene from a single viewpoint, predicting what an agent might see if it moved to a new viewpoint or even a completely unseen area, such as around a corner. Beyond potential applications in video editing and bringing photos to life, solving this task promises to codify knowledge about human environments to benefit robotic agents navigating in the real world. For example, a robot tasked with finding a particular room or object in an unfamiliar building could perform simulations using the world model to identify likely locations before physically searching anywhere. World models such as Pathdreamer can also be used to increase the amount of training data for agents, by training agents in the model.

Provided with just a single observation (RGB, depth, and segmentation) and a proposed navigation trajectory as input, Pathdreamer synthesizes high resolution 360º observations up to 6-7 meters away from the original location, including around corners. For more results, please refer to the full video.

How Does Pathdreamer Work?
Pathdreamer takes as input a sequence of one or more previous observations, and generates predictions for a trajectory of future locations, which may be provided up front or iteratively by the agent interacting with the returned observations. Both inputs and predictions consist of RGB, semantic segmentation, and depth images. Internally, Pathdreamer uses a 3D point cloud to represent surfaces in the environment. Points in the cloud are labelled with both their RGB color value and their semantic segmentation class, such as wall, chair or table.

To predict visual observations in a new location, the point cloud is first re-projected into 2D at the new location to provide ‘guidance’ images, from which Pathdreamer generates realistic high-resolution RGB, semantic segmentation and depth. As the model ‘moves’, new observations (either real or predicted) are accumulated in the point cloud. One advantage of using a point cloud for memory is temporal consistency — revisited regions are rendered in a consistent manner to previous observations.

Internally, Pathdreamer represents surfaces in the environment via a 3D point cloud containing both semantic labels (top) and RGB color values (bottom). To generate a new observation, Pathdreamer ‘moves’ through the point cloud to the new location and uses the re-projected point cloud image for guidance.

To convert guidance images into plausible, realistic outputs Pathdreamer operates in two stages: the first stage, the structure generator, creates segmentation and depth images, and the second stage, the image generator, renders these into RGB outputs. Conceptually, the first stage provides a plausible high-level semantic representation of the scene, and the second stage renders this into a realistic color image. Both stages are based on convolutional neural networks.

Pathdreamer operates in two stages: the first stage, the structure generator, creates segmentation and depth images, and the second stage, the image generator, renders these into RGB outputs. The structure generator is conditioned on a noise variable to enable the model to synthesize diverse scenes in areas of high uncertainty.

Diverse Generation Results
In regions of high uncertainty, such as an area predicted to be around a corner or in an unseen room, many different scenes are possible. Incorporating ideas from stochastic video generation, the structure generator in Pathdreamer is conditioned on a noise variable, which represents the stochastic information about the next location that is not captured in the guidance images. By sampling multiple noise variables, Pathdreamer can synthesize diverse scenes, allowing an agent to sample multiple plausible outcomes for a given trajectory. These diverse outputs are reflected not only in the first stage outputs (semantic segmentation and depth images), but in the generated RGB images as well.

Pathdreamer is capable of generating multiple diverse and plausible images for regions of high uncertainty. Guidance images on the leftmost column represent pixels that were previously seen by the agent. Black pixels represent regions that were previously unseen, for which Pathdreamer renders diverse outputs by sampling multiple random noise vectors. In practice, the generated output can be informed by new observations as the agent navigates the environment.

Pathdreamer is trained with images and 3D environment reconstructions from Matterport3D, and is capable of synthesizing realistic images as well as continuous video sequences. Because the output imagery is high-resolution and 360º, it can be readily converted for use by existing navigation agents for any camera field of view. For more details and to try out Pathdreamer yourself, we recommend taking a look at our open source code.

Application to Visual Navigation Tasks
As a visual world model, Pathdreamer shows strong potential to improve performance on downstream tasks. To demonstrate this, we apply Pathdreamer to the task of Vision-and-Language Navigation (VLN), in which an embodied agent must follow a natural language instruction to navigate to a location in a realistic 3D environment. Using the Room-to-Room (R2R) dataset, we conduct an experiment in which an instruction-following agent plans ahead by simulating many possible navigable trajectory through the environment, ranking each against the navigation instructions, and choosing the best ranked trajectory to execute. Three settings are considered. In the Ground-Truth setting, the agent plans by interacting with the actual environment, i.e. by moving. In the Baseline setting, the agent plans ahead without moving by interacting with a navigation graph that encodes the navigable routes within the building, but does not provide any visual observations. In the Pathdreamer setting, the agent plans ahead without moving by interacting with the navigation graph and also receives corresponding visual observations generated by Pathdreamer.

When planning ahead for three steps (approximately 6m), in the Pathdreamer setting the VLN agent achieves a navigation success rate of 50.4%, significantly higher than the 40.6% success rate in the Baseline setting without Pathdreamer. This suggests that Pathdreamer encodes useful and accessible visual, spatial and semantic knowledge about real-world indoor environments. As an upper bound illustrating the performance of a perfect world model, under the Ground-Truth setting (planning by moving) the agent’s success rate is 59%, although we note that this setting requires the agent to expend significant time and resources to physically explore many trajectories, which would likely be prohibitively costly in a real-world setting.

We evaluate several planning settings for an instruction-following agent using the Room-to-Room (R2R) dataset. Planning ahead using a navigation graph with corresponding visual observations synthesized by Pathdreamer (Pathdreamer setting) is more effective than planning ahead using the navigation graph alone (Baseline setting), capturing around half the benefit of planning ahead using a world model that perfectly matches reality (Ground-Truth setting).

Conclusions and Future Work
These results showcase the promise of using world models such as Pathdreamer for complicated embodied navigation tasks. We hope that Pathdreamer will help unlock model-based approaches to challenging embodied navigation tasks such as navigating to specified objects and VLN.

Applying Pathdreamer to other embodied navigation tasks such as Object-Nav, continuous VLN, and street-level navigation are natural directions for future work. We also envision further research on improved architecture and modeling directions for the Pathdreamer model, as well as testing it on more diverse datasets, including but not limited to outdoor environments. To explore Pathdreamer in more detail, please visit our GitHub repository.

This project is a collaboration with Jason Baldridge, Honglak Lee, and Yinfei Yang. We thank Austin Waters, Noah Snavely, Suhani Vora, Harsh Agrawal, David Ha, and others who provided feedback throughout the project. We are also grateful for general support from Google Research teams. Finally, we thank Tom Small for creating the animation in the third figure.

Source: Google AI Blog

Magic in the making: The 4 pillars of great creative

Consumers report that helpfulness is their top expectation of brands since the start of the COVID-19 pandemic, with 78% saying a brand's advertising should show how they can be helpful in everyday life.1 This means businesses need to quickly engage audiences with meaningful messages, using immersive storytelling to bring their brand and products to life.

To help you build visually-rich ad experiences that easily drive consumers to action, we've brought together our top creative guidance across Google Ads solutions in a single guide. Learn to craft stronger calls-to-action, engaging ad copy and striking visual assets — plus, get the latest insights from our team of creative and data scientists at Creative Works. You can also explore tips by marketing objective in order to craft more impactful creative to meet your business goals.

An image of two phones featuring natural soap products.

Dr. Squatch using a clear call-to-action, engaging copy and rich product visuals with Google Ads.

The 4 pillars of compelling creative

Lead with a clear call-to-action:Personalized descriptions perform up to two times better for their campaign goal versus non-personalized descriptions.2 This means businesses need to help consumers immediately see what they have to offer by including words like "you" to draw attention, and adding their product or brand name in headlines and descriptions.

Connect more authentically with a wide variety of assets:Audiences take action faster if they can relate to your message — 64% of consumers said they took some sort of action after seeing an ad that they considered to be diverse or inclusive.3 And images that feature people perform over 30% better for their campaign goal versus images that don’t.4 Given the variety of consumers looking online for new products to try, brands should show a wide range of people using their products or services to resonate with audiences.

Build for smaller screens: Images with no overlaid text, or overlaid text under 20 characters, perform up to 1.2X better for their campaign goal versus images with longer overlaid text.5 With people spending more time on a broad range of small devices, businesses should consider how and where consumers are seeing their ads and provide visual assets that clearly communicate their call-to-action.

Give your creatives time to test: We've seen that waiting 2-3 weeks between changes to ad creative minimizes performance fluctuations, allowing the Google Ads system time to learn and adapt to your most effective assets. Review Ad strength and asset reporting to better understand which assets resonate best and help you make the call on which to remove or replace.

An image of two phones featuring beauty products.

Beauty brand COSMEDIX using a variety of image assets in multiple aspect ratios with Google Ads.

Get help with building better assets

Consumers expect businesses of all sizes to offer more helpful brand experiences. Stand out with more relevant, engaging offers with help from our new guide to building better creative. And for more support with developing new creative assets or campaign strategies, check out our approved creative production agencies to find the right partner to help you achieve your business goals.

1. Kantar, COVID-19 Barometer Global Report, Wave2, runs across 50 countries, n=9,815, fielded 27th-30th March 2020.

2. Google internal data based on an aggregate study of median performance of campaign goals for Responsive display ads (CTR), Discovery ads (CTR, CVR), Video action campaigns (VTR) and Video discovery ads (VTR) across 78K assets for Media & Entertainment, Retail, and Finance verticals. Global. January 2020 - June 2021.

3. Google/Ipsos, U.S., Inclusive Marketing Study, n of 2,987 U.S. consumers ages 13–54 who access the internet at least monthly, Aug. 2019.

4. Google internal data based on an aggregate study of median performance of campaign goals for Discovery ads (CTR, CVR), Video action campaigns (VTR), Video discovery ads (VTR), App campaigns for installs (IPM), and App campaigns for engagement (EPM) for Media & Entertainment, Retail, and Finance verticals. Global. January 2020 - June 2021.

5. Google internal data based on an aggregate study of median performance of campaign goals for Discovery ads (CTR, CVR), and Responsive display ads (CTR) across 78K assets for Media & Entertainment, Retail, and Finance verticals. Global. January 2020 - June 2021.

Open Source in the 2021 Accelerate State of DevOps Report

To truly thrive, organizations need to adopt practices and capabilities that will lead them to performance improvements. Therefore, having access to data-driven insights and recommendations about the most effective and efficient ways to develop and deliver technology is critical. Over the past seven years, the DevOps Research and Assessment (DORA) has collected data from more than 32,000 industry professionals and used rigorous statistical analysis to deepen our understanding of the practices that lead to excellence in technology delivery and to powerful business outcomes.
One of the most valuable insights that has come from this research is the categorization of organizations on four different performance profiles (Elite, High, Medium, and Low) based on their performance on four software delivery metrics centered around throughput and stability - Deployment Frequency, Lead Time for Changes, Time to Restore Service and Change Failure Rate. We found that organizations that excel at these four metrics can be classified as elite performers while those that do not can be classified as low performers. See DevOps Research and Assessment (DORA) for a detailed description of these metrics and the different levels of organizational performance.

DevOps Research and Assessment (DORA) showing a detailed description of these metrics and the different levels of organizational performance

We have found that a number of technical capabilities are associated with improved continuous delivery performance. Our findings indicate that organizations that have incorporated loosely coupled architecture, continuous testing and integration, truck-based development, deployment automation, database change management, monitoring and observability and have leveraged open source technologies perform better than organizations that have not adopted these capabilities.

Now that you know a little bit about what DORA is and some of its key findings, let’s dive into whether the use of open source technologies within organizations impacts performance.

A quick Google search will yield hundreds (if not, thousands) of articles describing the myriad of ways organizations benefit from using open source software—faster innovation, higher quality products, stronger security, flexibility, ease of customization, etc. We know using open source software is the way to go, but until recently, we still had little empirical evidence demonstrating that its use is associated with improved organizational performance – until today.

This year, we surveyed 1,200 working professionals from a variety of industries around the globe about the factors that drive higher performance, including the use of open source software. Research from this year’s DORA report illustrates that low performing organizations have the highest use of proprietary software. In contrast, elite performers are 1.75 times more likely to make extensive use of open source components, libraries, and platforms. We also find that elite performers are 1.5 times more likely to have plans to expand their use of open source software compared to their low-performing counterparts. But, the question remains—does leveraging open source software impact an organization’s performance? Turns out the answer is, yes!

Our research also found that elite performers who meet their reliability targets are 2.4 times more likely to leverage open source technologies. We suspect that the original tenets of the open source movement of transparency and collaboration play a big role. Developers are less likely to waste time reinventing the wheel which allows them to spend more time innovating, they are able to leverage global talent instead of relying on the few people in their team or organization.

Technology transformations take time, effort, and resources. They also require organizations to make significant mental shifts. These shifts are easier when there is empirical evidence backing recommendations—organizations don’t have to take someone’s word for it, they can look at the data, look at the consistency of findings to know that success and improvement are in fact possible.

In addition to open source software, the 2021 Accelerate State of DevOps Report discusses a variety of capabilities and practices that drive performance. In the 2021 report, we also examined the effects of SRE best practices, the pandemic and burnout, the importance of quality documentation, and we revisited our exploration of leveraging the cloud. If you’d like to read the full report or any previous report, you can visit

Giving users more transparency into their Google ad experience

Today, people engage with a wider variety of ad formats on more Google products than ever before — from Video ads on YouTube to Shopping ads across Search, Display and more. And they increasingly want to know more about the ads they see. That’s why we’ve been innovating on features like “About this ad” to help users understand why an ad was shown, and to mute ads or advertisers they aren’t interested in.

Last spring, we also introduced an advertiser identity verification program that requires Google advertisers to verify information about their businesses, where they operate from and what they’re selling or promoting. This transparency helps users learn more about the company behind a specific ad. It also helps differentiate credible advertisers in the ecosystem, while limiting the ability of bad actors to misrepresent themselves. Since launching the program last year, we have started verifying advertisers in more than 90 countries — and we’re not stopping there.

Introducing advertiser pages 

To give users of our products even more transparency, we are enhancing ad disclosures with new advertiser pages. Users can access these disclosures in our new “About this ad” menu to see the ads a specific verified advertiser has run over the past 30 days. For example, imagine you’re seeing an ad for a coat you’re interested in, but you don’t recognize the brand. With advertiser pages, you can learn more about that advertiser before visiting their site or making a purchase.

Users can tap on an ad to learn more about the advertiser showing them the ad

In addition to learning about the ads and advertiser, users can more easily report an ad if they believe it violates one of our policies. When an ad is reported, a member of our team reviews it for compliance with our policies and will take it down if appropriate. Creating a safe experience is a top priority for us, and user feedback is an important part of how we do that.

Advertiser pages will launch in the coming months in the United States, and will roll out in phases to more countries in 2022. We will also continue to explore how to share additional data within advertiser pages over time.

Improving transparency for ads on Google

Enhanced ad disclosures build on our efforts to create a clear and intuitive experience for users who engage with ads on Google products. More than 30 million users interact with our ads transparency and control menus every day, and “About this ad” has received positive feedback on its streamlined experience. Users engage with our ads transparency and control tools on YouTube more than any other Google product. To help our users make informed decisions online — no matter where they engage — we will roll out the “About this ad” feature to YouTube and Search in the coming months. 

We're committed to creating a trustworthy Google ad experience, and enhanced ad disclosures represent the next step in that journey. We will continue to work towards helping our users have greater control and understanding over the ads they see.

Easily chat with meeting participants from a Google Calendar event

What’s changing 

We’re adding an option that makes it easy to chat with meeting attendees directly from Google Calendar. Within the Calendar event on web or mobile, you’ll see a Chat icon next to the guest list — simply select this icon to create a group chat containing all event participants. Please note: this only applies to participants within your organization, external attendees are not included in the chat group.This makes it simple to chat with guests before, during, or after any meeting. 

Chat with event attendees directly from the Calendar event on mobile devices

Chat with event attendees directly from the Calendar event on mobile devices

Chat with event attendees directly from the Calendar event on web

Chat with event attendees directly from the Calendar event on web

Who’s impacted 

End users 

Why you’d use it 

Previously, the main way to communicate with Calendar event attendees was via email. However, there are times when Chat may be preferred to email for communication. For example, sending a message that you’re running late, or sharing resources with attendees not long before the meeting starts. Now, the email and chat options are side by side on the calendar event. This can help you quickly choose whichever form of communication you prefer, and start conversations with just a few taps. When combined with Chat suggestions, it’s always easy to communicate with event participants via chat. 

Getting started 

Rollout pace 

On the web: 

On mobile: 


  • Available to all Google Workspace customers, as well as G Suite Basic and Business customers 


Building a sustainable future for travel

There’s a lot to consider when it comes to booking travel: price, health and safety, environmental impact and more. Last year, we shared travel tools to help you find health and safety information. Now we want to make it easier for you to find sustainable options while traveling — no matter what you’re doing or where you’re going.  

To make that happen, we’ve created a new team of engineers, designers and researchers focused solely on travel sustainability. Already, this team is working to highlight sustainable options within our travel tools that people use every day. 

Beginning this week, when you search for hotels on Google, you’ll see information about their sustainability efforts. Hotels that are certified for meeting high standards of sustainability from certain independent organizations, like Green Key or EarthCheck, will have an eco-certified badge next to their name. Want to dive into a hotel’s specific sustainability practices? Click on the “About” tab to see a list of what they’re doing — from waste reduction efforts and sustainably sourced materials to energy efficiency and water conservation measures.

Someone searches for a hotel in San Francisco and checks the hotel's sustainability attributes.

We’re working with hotels around the world, including independent hotels and chains such as Hilton and Accor, to gather this information and make it easily accessible. If you’re a hotel owner with eco-certifications or sustainability practices you want to share with travelers, simply sign in to Google My Business to add the attributes to your Business Profile or contact Google My Business support

Making travel more sustainable isn’t something we can do alone, which is why we’re also joining the global Travalyst coalition. As part of this group, we’ll help develop a standardized way to calculate carbon emissions for air travel. This free, open impact model will provide an industry framework to estimate emissions for a given flight and share that information with potential travelers. We’ll also contribute to the coalition’s sustainability standards for accommodations and work to align our new hotel features with these broader efforts.

All these updates are part of our commitment over the next decade to invest in technologies that help our partners and people around the world make sustainable choices. Look out for more updates in the months ahead as our travel sustainability team works with experts and partners to create a more sustainable future for all.

Helping travelers discover new things to do

While travel restrictions continue to vary across the globe, people are still dreaming of places to visit and things to do. Searches for “activities near me” have grown over the past 12 months, with specific queries like “ziplining” growing by 280% and “aquariums” by 115% globally. In response to this increasing interest, and to support the travel industry’s recovery, we’re introducing new ways to discover attractions, tours and activities on Search. 

Now, when people search on Google for attractions like the Tokyo Tower or the Statue of Liberty, they’ll see not just general information about the point of interest, but also booking links for basic admission and other ticket options where available. In the months ahead, we’ll also begin showing information and booking links for experiences in a destination, like wine tasting in Paris or bike tours in California. 

Ticketing options will show what rates each partner prices their tickets at.

Select ‘Tickets’ to see ticketing options available from partner websites.

There are a variety of partners that we’re working with, including online travel agencies and technology providers, to make this information available on Search. If you operate any attractions, tours or activities and want to participate, learn more in the Help Center.

Our goal is to help people find and compare all the best travel options, which is why partners can promote their ticket booking links at zero cost — similar to the free hotel booking links introduced earlier this year.

While it’s still early days, we’ve found that free hotel booking links result in increased engagement for both small and large partners. Hotels working with the booking engine WebHotelier saw more than $4.7M in additional revenue from free booking links this summer. With more than 6,000 active hotels, WebHotelier shared that they were "pleasantly surprised to receive reservations right from Google at no additional cost." This is one of the ways Google can support your business during recovery. 

Introducing a new ad format for things to do

We’re also introducing a new ad format for things to do that will help advertisers drive additional revenue and bookings as recovery continues. With more details like pricing, images and reviews, these new ads on Search will help partners stand out and expand their reach even further. Read more about how to get started in our Help Center.

This shows ads as the first search result and helps our paid partners get to the top of the page.

Ads to promote discovery of things to do and drive bookings.

It’s more important than ever to get the right insights, education and best practices you need as the travel landscape continues to evolve. In July, our team launched Travel Insights with Google in the U.S. to share Google’s travel demand insights with the world. And tomorrow — Thursday, September 23 — we’ll host a webinar to share tips and tricks for using Travel Insights with Google to help you better understand evolving travel demand. 

Across our new product updates and ongoing feature enhancements, we look forward to partnering closely on the travel recovery effort and preparing for the road ahead. 

Announcing WIT: A Wikipedia-Based Image-Text Dataset

Multimodal visio-linguistic models rely on rich datasets in order to model the relationship between images and text. Traditionally, these datasets have been created by either manually captioning images, or crawling the web and extracting the alt-text as the caption. While the former approach tends to result in higher quality data, the intensive manual annotation process limits the amount of data that can be created. On the other hand, the automated extraction approach can lead to bigger datasets, but these require either heuristics and careful filtering to ensure data quality or scaling-up models to achieve strong performance. An additional shortcoming of existing datasets is the dearth of coverage in non-English languages. This naturally led us to ask: Can one overcome these limitations and create a high-quality, large-sized, multilingual dataset with a variety of content?

Today we introduce the Wikipedia-Based Image Text (WIT) Dataset, a large multimodal dataset, created by extracting multiple different text selections associated with an image from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets. As detailed in “WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning”, presented at SIGIR ‘21, this resulted in a curated set of 37.5 million entity-rich image-text examples with 11.5 million unique images across 108 languages. The WIT dataset is available for download and use under the Creative Commons license. We are also excited to announce that we are hosting a competition with the WIT dataset in Kaggle in collaboration with Wikimedia Research and other external collaborators.

Dataset   Images     Text     Contextual Text     Languages  
Flickr30K 32K 158K - < 8
SBU Captions     1M 1M - 1
MS-COCO 330K 1.5M - < 4; 7 (test only)
WIT 11.5M 37.5M ~119M 108
WIT’s increased language coverage and larger size relative to previous datasets.

The unique advantages of the WIT dataset are:

  1. Size: WIT is the largest multimodal dataset of image-text examples that is publicly available.
  2. Multilingual: With 108 languages, WIT has 10x or more languages than any other dataset.
  3. Contextual information: Unlike typical multimodal datasets, which have only one caption per image, WIT includes many page-level and section-level contextual information.
  4. Real world entities: Wikipedia, being a broad knowledge-base, is rich with real world entities that are represented in WIT.
  5. Challenging test set: In our recent work accepted at EMNLP, all state-of-the-art models demonstrated significantly lower performance on WIT vs. traditional evaluation sets (e.g., ~30 point drop in recall).

Generating the Dataset
The main goal of WIT was to create a large dataset without sacrificing on quality or coverage of concepts. Thus, we started by leveraging the largest online encyclopedia available today: Wikipedia.

For an example of the depth of information available, consider the Wikipedia page for Half Dome (Yosemite National Park, CA). As shown below, the article has numerous interesting text captions and relevant contextual information for the image, such as the page title, main page description, and other contextual information and metadata.

Example wikipedia page with various image-associated text selections and contexts we can extract. From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0.
Example of the Wikipedia page for this specific image of Half Dome. From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0.

We started by selecting Wikipedia pages that have images, then extracted various image-text associations and surrounding contexts. To further refine the data, we performed a rigorous filtering process to ensure data quality. This included text-based filtering to ensure caption availability, length and quality (e.g., by removing generic default filler text); image-based filtering to ensure each image is a certain size with permissible licensing; and finally, image-and-text-entity–based filtering to ensure suitability for research (e.g., excluding those classified as hate speech). We further randomly sampled image-caption sets for evaluation by human editors, who overwhelmingly agreed that 98% of the samples had good image-caption alignment.

Highly Multilingual
With data in 108 languages, WIT is the first large-scale, multilingual, multimodal dataset.

# of Image-Text Sets   Unique Languages   # of Images   Unique Languages  
> 1M 9 > 1M 6
500K - 1M 10 500K - 1M 12
  100K - 500K   36   100K - 500K   35
50K - 100K 15 50K - 100K 17
14K - 50K 38 13K - 50K 38
WIT: coverage statistics across languages.
Example of an image that is present in more than a dozen Wikipedia pages across >12 languages. From the Wikipedia page for Wolfgang Amadeus Mozart.

The First Contextual Image-Text Dataset
Most multimodal datasets only offer a single text caption (or multiple versions of a similar caption) for the given image. WIT is the first dataset to provide contextual information, which can help researchers model the effect of context on image captions as well as the choice of images.

WIT dataset example showing image-text data and additional contextual information.

In particular, key textual fields of WIT that may be useful for research include:

  • Text captions: WIT offers three different kinds of image captions. This includes the (potentially context influenced) “Reference description”, the (likely context independent) “Attribution description” and “Alt-text description”.
  • Contextual information: This includes the page title, page description, URL and local context about the Wikipedia section including the section title and text.

WIT has broad coverage across these different fields, as shown below.

Image-Text Fields of WIT     Train Val Test Total / Unique
Rows / Tuples   37.1M     261.8K     210.7K   37.6M
Unique Images 11.4M 58K 57K 11.5M
Reference Descriptions 16.9M 150K 104K   17.2M / 16.7M  
Attribution Descriptions 34.8M 193K 200K 35.2M / 10.9M
Alt-Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M
Key fields of WIT include both text captions and contextual information.

A High-Quality Training Set and a Challenging Evaluation Benchmark
The broad coverage of diverse concepts in Wikipedia means that the WIT evaluation sets serve as a challenging benchmark, even for state-of-the-art models. We found that for image-text retrieval, the mean recall scores for traditional datasets were in the 80s, whereas for the WIT test set, it was in the 40s for well-resourced languages and in the 30s for the under-resourced languages. We hope this in turn can help researchers to build stronger, more robust models.

WIT Dataset and Competition with Wikimedia and Kaggle
Additionally, we are happy to announce that we are partnering with Wikimedia Research and a few external collaborators to organize a competition with the WIT test set. We are hosting this competition in Kaggle. The competition is an image-text retrieval task. Given a set of images and text captions, the task is to retrieve the appropriate caption(s) for each image.

To enable research in this area, Wikipedia has kindly made available images at 300-pixel resolution and a Resnet-50–based image embeddings for most of the training and the test dataset. Kaggle will be hosting all this image data in addition to the WIT dataset itself and will provide colab notebooks. Further, the competitors will have access to a discussion forum in Kaggle in order to share code and collaborate. This enables anyone interested in multimodality to get started and run experiments easily. We are excited and looking forward to what will result from the WIT dataset and the Wikipedia images in the Kaggle platform.

We believe that the WIT dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques, ultimately leading to improved Machine Learning models in real-world tasks over visio-linguistic data. For any questions, please contact [email protected]. We would love to hear about how you are using the WIT dataset.

We would like to thank our co-authors in Google Research: Jiecao Chen, Michael Bendersky and Marc Najork. We thank Beer Changpinyo, Corinna Cortes, Joshua Gang, Chao Jia, Ashwin Kakarla, Mike Lee, Zhen Li, Piyush Sharma, Radu Soricut, Ashish Vaswani, Yinfei Yang, and our reviewers for their insightful feedback and comments.

We thank Miriam Redi and Leila Zia from Wikimedia Research for collaborating with us on the competition and providing image pixels and image embedding data. We thank Addison Howard and Walter Reade for helping us host this competition in Kaggle. We also thank Diane Larlus (Naver Labs Europe (NLE)), Yannis Kalantidis (NLE), Stéphane Clinchant (NLE), Tiziano Piccardi Ph.D. student at EPFL, Lucie-Aimée Kaffee PhD student at University of Southampton and Yacine Jernite (Hugging Face) for their valuable contribution towards the competition.

Source: Google AI Blog

Chrome for Android Update

Hi, everyone! We've just released Chrome 94 (94.0.4606.50) for Android: it'll become available on Google Play over the next few days.

This release includes stability and performance improvements. You can see a full list of the changes in the Git log. If you find a new issue, please let us know by filing a bug.

Krishna Govind
Google Chrome

Perform refined email searches with new rich filters in Gmail on web

Quick launch summary 

When searching in Gmail on web, enhanced search chips will provide richer drop-down lists with more options that help you apply additional filters. For example, when you click on the “From” chip, you’ll now be able to quickly type a name, choose from a list of suggested senders, or search for emails from multiple senders. Available now for all users, search chips make it quicker and easier to find the specific email or information you’re looking for. 

A richer drop down list in search

Getting started 

  • Admins: There is no admin control for this feature. 
  • End users: There is no end user setting for this feature, chips will appear automatically when you perform a search in Gmail on the web. Use the Help Center to learn more about search in Gmail. 

Rollout pace 

  • This feature is available now for all users. 


  • Available to all Google Workspace customers, as well as G Suite Basic and Business customers. Also available to users with personal Google Accounts