How Grant builds inclusivity in and outside of work

Welcome to the latest edition of “My Path to Google,” where we talk to Googlers, interns and alumni about how they got to Google, what they do in their roles and how they prepared for their interviews.

Today’s post is all about Grant Bennett, a Human Resource Associate working remotely from North Carolina, and his passion for driving equity and inclusivity both in and outside of Google.

What do you do at Google?

I’m in Google’s Human Resources Associate program, a two-year rotational program for recent college graduates. Now in my second and final rotation, I work as an Operations and Analytics Specialist on the Retention and Progression team. I help analyze and share insights to improve Googlers’ experiences.

What’s your typical workday like?

I’ve been working remotely from North Carolina since I started at Google in 2020. My day usually begins with a morning workout and some dedicated reading time. Once I log in to work, I check emails, create my to-do list and take data science skills training. The rest of my day is spent jumping in and out of meetings with teammates and consultants, working through data and generating reports for my team.

Can you tell us a bit more about your background?

I grew up in Fayetteville, North Carolina. My father served in the military and my mother is an educator at our local community college. Baseball was my favorite activity as a kid. One time when I was practicing in downtown Fayetteville, a director asked me to make a cameo in a music video for the rapper J.Cole (which I eagerly accepted). I met J.Cole again years later, and we talked about the importance of branching out and having new experiences. That conversation inspired me to attend a Historically Black College or University (HBCU). I enrolled at Morehouse College, the only all-male HBCU in the United States, on a baseball and academic scholarship. I studied psychology and got really involved in campus life. Through these experiences, I found my passion for social impact and research.

What’s your daily source of inspiration?

I’m inspired that Google continues to work towards creating a more equitable and inclusive workplace, and I’m excited to take on projects connected to our HBCU commitments. I’m passionate about this work, because I understand the value of investing in communities that have been historically under-resourced and excluded.

Are you working on any projects outside of work?

I’m the Founder and Executive Director of The Two-Six Project, a nonprofit organization helping to develop leaders from marginalized communities. We provide funding, leadership development training and scholarships to youth athletic organizations in the Fayetteville area. Thanks in part to the generous support of individual Googlers during our holiday giving campaign, The Two-Six Project recently hosted its second annual “Christmas Giveback” event and provided food, toys and winter clothing to over 2,000 people. The success of this event led to a feature in Forbes Magazine and my participation on a panel about equity, moderated by Former U.S. President Bill Clinton.

Grant, in a black t-shirt, posing in front of the Google logo outside our Mountain View headquarters.

Grant at Google’s Mountain View headquarters.

How did you prepare for your Google interviews?

I really studied my resumé to help me tell my career story and quantify my impact. I also researched behavioral-based questions — “tell me about a time you…” — and asked close friends to conduct mock interviews.

What advice would you give to your past self?

I would remind myself that my perspective is valuable. Coming from an HBCU, you may feel a sense of imposter syndrome or self-doubt when going through the hiring process. But it’s important to remember that your unique experience helps you impact the world in your own way. I would tell myself to trust the path that got me here, and to focus on showing why I would be a good fit for the role.

Any tips for aspiring Googlers?

No matter what, be authentic. Google is a melting pot of diverse people, so know that you will add just as much value to the company as it will add to your professional growth. Don’t be afraid to ask questions, be intentional with your energy and build healthy habits around networking.

Material You: Coming to more Android devices near you

Posted by Rohan Shah, Product Manager on Android

We’re excited to announce that Material You, specifically dynamic color, will soon be available on more Android 12 phones globally, including devices by Samsung, OnePlus, Oppo, Vivo, realme, Xiaomi, Tecno, and more!

With the release of Android 12 and the introduction of Material You, we made the Android experience more fluid and personal than ever for our users. The gorgeous new design brought to life experiences such as a more dynamic touch ripple, a silky-smooth scroll, and a spacious layout. But the star of the show was, and continues to be, dynamic color – pick your favorite wallpaper and the entire phone experience transforms to better express you, from your home screen to some of your favorite apps.

With Material You, personalization is now a defining trait of Android that our ecosystem will continue building on for years to come. We want to make sure that you, our developers, have the confidence to join us on the journey and bring a more personal look and feel to users through your apps.


A Gmail rainbow with different wallpaper-based themes, shown on some of the Android device experiences that will support Material You 

A Gmail rainbow with different wallpaper-based themes, shown on some of the Android device experiences that will support Material You


As more Android 12 devices land in the next couple months, our OEM partners are working with us to ensure that key design APIs, especially around dynamic color, work consistently across the Android ecosystem so developers can have peace of mind and users can benefit from a cohesive experience.

To better help you understand how to implement dynamic color and fit it into your overall brand story, the Material team has published the comprehensive Customizing Material article with codelabs and guides to get started with Views or Jetpack Compose. Watch for ongoing updates to Material Theme Builder and Material Color Utilities in the coming months to provide you with the tools you need for design and implementation.


Visualize dynamic color in your app with the Material Theme Builder

Visualize dynamic color in your app with the Material Theme Builder


Google apps (Gmail, Photos, Chrome, and many more) have used the very same tools and guidance to bring the color story to life on their branded experiences, and we’re excited for you to hop on board as well. As you learn more about how color can harmonize with user choice and work with dynamic color in your app, we’d love to get your feedback via the Material Android issue tracker. Happy coloring!

Making sure everyone feels “Seen on Pixel”

The Super Bowl has always been a special moment for Google. From our first Super Bowl ad in 2010, “Parisian Love,” to our 2020 spot “Loretta,” we try to shine a light on the challenges we’re focused on solving with our technology and tell the stories of real people impacted by our products.

And today, we’re continuing this legacy with our latest Super Bowl ad, “Seen on Pixel,” which tells the story of Real Tone, Google’s years-long efforts to ensure all our camera and imaging products accurately represent all skin tones.

For too long, camera technology, including our own, has failed people of color by either making them look washed out or too unnaturally bright or dark. Because everyone deserves to be seen as they truly are, we are committed to addressing this gap. Internally, Googlers of color volunteered to test the camera on Pixel 6 before we launched it and provided input on what was working and what could be better. Externally, we partnered with image experts who spent months with our engineers, testing the camera and providing detailed and thoughtful feedback that helped improve our camera and editing products, including adding significantly more portraits of people of color in the image datasets that train our camera models. This collective teamwork allowed us to launch what we call Real Tone, with Pixel 6 as our first camera to feature these improvements.

Since the launch of Real Tone on Google Pixel 6 and Pixel 6 Pro last October, we have seen the difference camera representation can make. “Seen on Pixel” brings to life what Real Tone represents. It is a montage of beautiful photography of individuals and families from all walks of life, all photographed on Pixel 6 by our director Joshua Kissi and contributing photographers Deun Ivory and Aundre Larrow. We partnered with award-winning artist Lizzo, who truly embodies the spirit of our campaign by always being her authentic self, unapologetically. Her powerful vocals as the soundtrack bring “Seen on Pixel” to life with a preview of her new song, “If You Love Me.”

Representation and equity in everything should always be the norm and the default. And until we reach it, our goal at Google will always be to make gains in the world every day through our products and storytelling.

Chrome Beta for Android Update

Hi everyone! We've just released Chrome Beta 99 (99.0.4844.27) for Android: it's now available on Google Play.

Chrome Beta for Android UpdateYou can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.

If you find a new issue, please let us know by filing a bug.

Ben Mason
Google Chrome

Google Ads API v10 RMF Update

Effective with Google Ads API version 10, we updated the Required Minimum Functionality (RMF) to use the Google Ads API. This is to reflect the evolution of the Google Ads platform, including the upgrade of Smart Shopping and Local Campaigns to Performance Max. We also published the requirements for Standard Shopping, Hotel-only and App Promotion-only tools.

There is a new product specific RMF for Smart Campaigns. This is not a requirement for all tools, only those that implement Smart Campaigns. If you use Smart Campaigns, this defines the minimum set of features that are required.

The minimum set of features for Performance Max campaigns are now available.

These changes will affect the following tools:
  • Shopping-only, Smart Shopping-only API tools
  • Special purpose tools that offer campaign creation or management functionality
These changes will not affect full-service, Hotel-only, App Promotion-only or reporting-only tools.

For precise details, see the updated Google Ads API Required Minimum Functionality.

Requirements for the AdWords API remain unchanged.

Standard Shopping Campaigns
With the release of Google Ads API v9, we simplified RMF requirements for Full-service tools. We are now making the same changes to Standard Shopping Campaigns.

The following features are still required but simplified or reduced in scope:
Item Number Functionality Change
C.190 Create ad group Optional: ability to create multiple ad groups
C.525 Add first (root) product partition This is a required step for creating a shopping campaign and it is done automatically. It is not separately invoked by the merchant. A new campaign should have a root partition otherwise it would not serve.
M.10 Edit campaign settings Only settings required at creation time would be required at change time (e.g. NetworkSettings would not be required to edit, since C.50 is no longer a requirement).
R.10 Customer Optional if only implementing one campaign.

The following features are no longer required. Developers may continue to use these features (unless already sunset), but they are no longer required in order to maintain compliance with the Terms & Conditions of using the Google Ads API. All these features, unless already sunset, are considered optional.

Item Number Functionality
C.14 Set mobile platform bid adjustment
C.15 Set tablet and desktop platform bid adjustments
C.21 Enable distance targeting
C.25 Set geo bid adjustment
C.50 Opt in/out of networks
C.90 Set bidding option: Manual CPC
C.95 Set bidding option: Enhanced CPC
C.101 Set bidding option: Maximize clicks (Portfolio)
C.140 Set delivery method
C.191 Set ad group max CPC bid
C.192 Set ad group max CPA
C.193 Set ad group target ROAS
C.320 Account-level tracking template
C.321 Campaign-level tracking template
C.325 Campaign-level custom parameters
C.326 Ad group-level custom parameters
C.328 Account-level final URL suffix
C.329 Campaign-level final URL suffix
C.700 Create ad group/campaign criterion that targets/excludes user list
C.710 Set userlist targeting bid adjustment for search network campaigns and ad groups
M.15 Edit mobile, tablet, and desktop platform bid adjustments
M.20 Edit ad group settings (all ad group-related required settings in Creation Functionality)
M.25 Edit geo bid adjustment
M.101 Edit bidding option: Maximize clicks (Standard)
M.120 Pause / enable / remove ad group
M.180 Edit product partition max CPC*
M.320 Manage all tracking templates in creation functionality
M.325 Manage all custom parameters in creation functionality
M.328 Manage all final URL suffixes in creation functionality
M.700 Edit ad group/campaign criterion that targets/excludes user list
M.710 Edit userlist targeting bid adjustment for search network campaigns and ad groups
R.30 Ad Group
R.80 Geographic View
R.150 Campaign Audience View
Ad Group Audience View

For more information
If you have questions specific to RMF, please contact the Google Ads API Compliance team at https://services.google.com/fb/forms/apicontact/.

If you have any questions or need additional help with the API, contact us via the forum.

Announcing v10 of the Google Ads API

Today we’re announcing the v10 release of the Google Ads API. To use some of the v10 features, you’ll need to upgrade your client libraries and client code. The updated client libraries and code examples will be published next week.

Note: Developers must migrate from the legacy AdWords API to Google Ads API by April 27, 2022. Here are the highlights of v10:
Where can I learn more?
The following resources can help you get started: If you have any questions or need additional help, contact us via the forum.

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention.

The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models.

To address these challenges, in “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. The central idea of this work is to decouple feature learning and feature abstraction (pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time.

Architecture Design
The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. While every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. This step is independent for each block and is also where feature learning occurs. Finally, a new computational layer called block aggregation then combines all of the spatially adjacent blocks. After each block aggregation, the features corresponding to four spatially adjacent blocks are fed to another module with a stack of transformer layers, which then process those four blocks jointly. This design naturally builds a pyramid hierarchical structure of the network, where bottom layers can focus on local features (such as textures) and upper layers focus on global features (such as object shape) at reduced dimensionality thanks to the block aggregation.

A visualization of the network processing an image: Given an input image, the network first partitions images into blocks, where each block contains 4 image patches. Image patches in every block are linearly projected as vectors and processed by a stack of identical transformer layers. Then the proposed block aggregation layer aggregates information from each block and reduces its spatial size by 4 times. The number of blocks is reduced to 1 at the top hierarchy and classification is conducted after the output of it.

Interpretability
This architecture has a non-overlapping information processing mechanism, independent at every node. This design resembles a decision tree-like structure, which manifests unique interpretability capabilities because every tree node contains independent information of an image block that is being received by its parent nodes. We can trace the information flow through the nodes to understand the importance of each feature. In addition, our hierarchical structure retains the spatial structure of images throughout the network, leading to learned spatial feature maps that are effective for interpretation. Below we showcase two kinds of visual interpretability.

First, we present a method to interpret the trained model on test images, called gradient-based class-aware tree-traversal (GradCAT). GradCAT traces the feature importance of each block (a tree node) from top to bottom of the hierarchy structure. The main idea is to find the most valuable traversal from the root node at the top layer to a child node at the bottom layer that contributes the most to the classification outcomes. Since each node processes information from a certain region of the image, such traversal can be easily mapped to the image space for interpretation (as shown by the overlaid dots and lines in the image below).

The following is an example of the model's top-4 predictions and corresponding interpretability results on the left input image (containing 4 animals). As shown below, GradCAT highlights the decision path along the hierarchical structure as well as the corresponding visual cues in local image regions on the images.

Given the left input image (containing four objects), the figure visualizes the interpretability results of the top-4 prediction classes. The traversal locates the model decision path along the tree and simultaneously locates the corresponding image patch (shown by the dotted line on images) that has the highest impact to the predicted target class.

Moreover, the following figures visualize results on the ImageNet validation set and show how this approach enables some intuitive observations. For instance, the example of the lighter below (upper left panel) is particularly interesting because the ground truth class — lighter/matchstick — actually defines the bottom-right matchstick object, while the most salient visual features (with the highest node values) are actually from the upper-left red light, which conceptually shares visual cues with a lighter. This can also be seen from the overlaid red lines, which indicate the image patches with the highest impact on the prediction. Thus, although the visual cue is a mistake, the output prediction is correct. In addition, the four child nodes of the wooden spoon below have similar feature importance values (see numbers visualized in the nodes; higher indicates more importance), which is because the wooden texture of the table is similar to that of the spoon.

Visualization of the results obtained by the proposed GradCAT. Images are from the ImageNet validation dataset.

Second, different from the original ViT, our hierarchical architecture retains spatial relationships in learned representations. The top layers output low-resolution features maps of input images, enabling the model to easily perform attention-based interpretation by applying Class Attention Map (CAM) on the learned representations at the top hierarchical level. This enables high-quality weakly-supervised object localization with just image-level labels. See the following figure for examples.

Visualization of CAM-based attention results. Warmer colors indicate higher attention. Images are from the ImageNet validation dataset.

Convergence Advantages
With this design, feature learning only happens at local regions independently, and feature abstraction happens inside the aggregation function. This design and simple implementation is general enough for other types of visual understanding tasks beyond classification. It also improves the model convergence speed greatly, significantly reducing the training time to reach the desired maximum accuracy.

We validate this advantage in two ways. First, we compare the ViT structure on the ImageNet accuracy with a different number of total training epochs. The results are shown on the left side of the figure below, demonstrating much faster convergence than the original ViT, e.g., around 20% improvement in accuracy over ViT with 30 total training epochs.

Second, we modify the architecture to conduct unconditional image generation tasks, since training ViT-based models for image generation tasks is challenging due to convergence and speed issues. Creating such a generator is straightforward by transposing the proposed architecture: the input is an embedding vector, the output is a full image in RGB channels, and the block aggregation is replaced by a block de-aggregation component supported by Pixel Shuffling. Surprisingly, we find our generator is easy to train and demonstrates faster convergence speed, as well as better FID score (which measures how similar generated images are to real ones), than the capacity-comparable SAGAN.

Left: ImageNet accuracy given different number of total training epochs compared with standard ViT architecture. Right: ImageNet 64x64 image generation FID scores (lower is better) with single 1000-epoch training. On both tasks, our method shows better convergence speed.

Conclusion
In this work we demonstrate the simple idea that decoupled feature learning and feature information extraction in this nested hierarchy design leads to better feature interpretability through a new gradient-based class-aware tree traversal method. Moreover, the architecture improves convergence on not only classification tasks but also image generation tasks. The proposed idea is focusing on aggregation function and thereby is orthogonal to advanced architecture design for self-attention. We hope this new research encourages future architecture designers to explore more interpretable and data-efficient ViT-based models for visual understanding, like the adoption of this work for high-resolution image generation. We have also released the source code for the image classification portion of this work.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Han Zhang, Long Zhao, Ting Chen, Sercan Arik, Tomas Pfister. We also thank Xiaohua Zhai, Jeremy Kubica, Kihyuk Sohn, and Madeleine Udell for the valuable feedback of the work.

Source: Google AI Blog


Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention.

The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models.

To address these challenges, in “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. The central idea of this work is to decouple feature learning and feature abstraction (pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time.

Architecture Design
The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. While every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. This step is independent for each block and is also where feature learning occurs. Finally, a new computational layer called block aggregation then combines all of the spatially adjacent blocks. After each block aggregation, the features corresponding to four spatially adjacent blocks are fed to another module with a stack of transformer layers, which then process those four blocks jointly. This design naturally builds a pyramid hierarchical structure of the network, where bottom layers can focus on local features (such as textures) and upper layers focus on global features (such as object shape) at reduced dimensionality thanks to the block aggregation.

A visualization of the network processing an image: Given an input image, the network first partitions images into blocks, where each block contains 4 image patches. Image patches in every block are linearly projected as vectors and processed by a stack of identical transformer layers. Then the proposed block aggregation layer aggregates information from each block and reduces its spatial size by 4 times. The number of blocks is reduced to 1 at the top hierarchy and classification is conducted after the output of it.

Interpretability
This architecture has a non-overlapping information processing mechanism, independent at every node. This design resembles a decision tree-like structure, which manifests unique interpretability capabilities because every tree node contains independent information of an image block that is being received by its parent nodes. We can trace the information flow through the nodes to understand the importance of each feature. In addition, our hierarchical structure retains the spatial structure of images throughout the network, leading to learned spatial feature maps that are effective for interpretation. Below we showcase two kinds of visual interpretability.

First, we present a method to interpret the trained model on test images, called gradient-based class-aware tree-traversal (GradCAT). GradCAT traces the feature importance of each block (a tree node) from top to bottom of the hierarchy structure. The main idea is to find the most valuable traversal from the root node at the top layer to a child node at the bottom layer that contributes the most to the classification outcomes. Since each node processes information from a certain region of the image, such traversal can be easily mapped to the image space for interpretation (as shown by the overlaid dots and lines in the image below).

The following is an example of the model's top-4 predictions and corresponding interpretability results on the left input image (containing 4 animals). As shown below, GradCAT highlights the decision path along the hierarchical structure as well as the corresponding visual cues in local image regions on the images.

Given the left input image (containing four objects), the figure visualizes the interpretability results of the top-4 prediction classes. The traversal locates the model decision path along the tree and simultaneously locates the corresponding image patch (shown by the dotted line on images) that has the highest impact to the predicted target class.

Moreover, the following figures visualize results on the ImageNet validation set and show how this approach enables some intuitive observations. For instance, the example of the lighter below (upper left panel) is particularly interesting because the ground truth class — lighter/matchstick — actually defines the bottom-right matchstick object, while the most salient visual features (with the highest node values) are actually from the upper-left red light, which conceptually shares visual cues with a lighter. This can also be seen from the overlaid red lines, which indicate the image patches with the highest impact on the prediction. Thus, although the visual cue is a mistake, the output prediction is correct. In addition, the four child nodes of the wooden spoon below have similar feature importance values (see numbers visualized in the nodes; higher indicates more importance), which is because the wooden texture of the table is similar to that of the spoon.

Visualization of the results obtained by the proposed GradCAT. Images are from the ImageNet validation dataset.

Second, different from the original ViT, our hierarchical architecture retains spatial relationships in learned representations. The top layers output low-resolution features maps of input images, enabling the model to easily perform attention-based interpretation by applying Class Attention Map (CAM) on the learned representations at the top hierarchical level. This enables high-quality weakly-supervised object localization with just image-level labels. See the following figure for examples.

Visualization of CAM-based attention results. Warmer colors indicate higher attention. Images are from the ImageNet validation dataset.

Convergence Advantages
With this design, feature learning only happens at local regions independently, and feature abstraction happens inside the aggregation function. This design and simple implementation is general enough for other types of visual understanding tasks beyond classification. It also improves the model convergence speed greatly, significantly reducing the training time to reach the desired maximum accuracy.

We validate this advantage in two ways. First, we compare the ViT structure on the ImageNet accuracy with a different number of total training epochs. The results are shown on the left side of the figure below, demonstrating much faster convergence than the original ViT, e.g., around 20% improvement in accuracy over ViT with 30 total training epochs.

Second, we modify the architecture to conduct unconditional image generation tasks, since training ViT-based models for image generation tasks is challenging due to convergence and speed issues. Creating such a generator is straightforward by transposing the proposed architecture: the input is an embedding vector, the output is a full image in RGB channels, and the block aggregation is replaced by a block de-aggregation component supported by Pixel Shuffling. Surprisingly, we find our generator is easy to train and demonstrates faster convergence speed, as well as better FID score (which measures how similar generated images are to real ones), than the capacity-comparable SAGAN.

Left: ImageNet accuracy given different number of total training epochs compared with standard ViT architecture. Right: ImageNet 64x64 image generation FID scores (lower is better) with single 1000-epoch training. On both tasks, our method shows better convergence speed.

Conclusion
In this work we demonstrate the simple idea that decoupled feature learning and feature information extraction in this nested hierarchy design leads to better feature interpretability through a new gradient-based class-aware tree traversal method. Moreover, the architecture improves convergence on not only classification tasks but also image generation tasks. The proposed idea is focusing on aggregation function and thereby is orthogonal to advanced architecture design for self-attention. We hope this new research encourages future architecture designers to explore more interpretable and data-efficient ViT-based models for visual understanding, like the adoption of this work for high-resolution image generation. We have also released the source code for the image classification portion of this work.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Han Zhang, Long Zhao, Ting Chen, Sercan Arik, Tomas Pfister. We also thank Xiaohua Zhai, Jeremy Kubica, Kihyuk Sohn, and Madeleine Udell for the valuable feedback of the work.

Source: Google AI Blog


Beta Channel Update for Desktop

The Beta channel has been updated to 99.0.4844.27 for Mac ,Windows and Linux.

A full list of changes in this build is available in the log. Interested in switching release channels? Find out how here. If you find a new issues, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.


Prudhvikumar BommanaGoogle Chrome