Tag Archives: accessibility

Ikumi Kobayashi on taking inclusion seriously

Welcome to the latest edition of “My Path to Google,” where we talk to Googlers, interns and alumni about how they got to Google, what their roles are like and even some tips on how to prepare for interviews.


Today’s post is all about Ikumi Kobayashi, a Search Optimization Specialist based out of Tokyo whose search for an inclusive and accessible workplace ultimately led her to her role at Google and a newfound confidence.


Can you tell us about your decision to apply to Google? 

I have profound hearing loss in both ears and use hearing aids. I rely on lip-reading during conversations. As a person with a disability (PwD), I struggled during my job hunt in Japan because most of the companies I applied to had limited job postings for PwD, and the benefits for PwD were often unequal compared to people without a disability. 


I decided to apply to Google because I wanted to work in a company that takes diversity and inclusion seriously. I was nervous before applying to Google because teamwork can be difficult for a hard-of-hearing person like me, but I decided to give it a try because I had nothing to lose.


How would you describe your path to your current role at Google? 

I studied communications in undergrad and joined Google right out of grad school, so Google is the first company I’ve worked at. I was an intern my first year at Google, and during that time my team supported me to overcome anxiety and build confidence as a Googler with a hearing disability. 


I started as a Google Ads Account Manager, but I found face-to-face conversations with many clients everyday difficult and I preferred working more with the product and with my teammates. After three months, I moved to my current team. My job title is now Search Optimization Specialist and my responsibility is to support Japanese companies in the entertainment industry as they run and optimize their Google Search Ads. It is very rewarding to see the companies I support grow and I am really thankful for the previous and current team who accommodated flexibly for me.

Ten people gathered around a table inside of a restaurant.

Ikumi and teammates out at dinner in 2019.

What does your typical day look like right now? 

After our Google Tokyo office completely shut down in March 2020, I have been working remotely in my apartment in Tokyo. I really miss meeting my teammates and friends in the office, but I keep myself energized by proactively setting up meetings as much as possible. Conversations with Googlers always help me to maximize my productivity. Outside of work, I'm a fashion enthusiast and go to a fashion design school three times a week after work. I love to watch fashion shows on YouTube during my free time.


What inspires you to come in (or log on) every day?

I am passionate about advocating for diversity, inclusion and accessibility so I joined the Disability Alliance — an employee resource group for Googlers. Right now, I am the only Japanese hard-of-hearing Googler on the Google Ads team and we can do more to diversify the Asia-Pacific Google community. I strive to do my best to make our community even more accessible for Googlers with disabilities.

Ikumi speaking into a microphone in front of a large group. A slide is projected behind her introducing herself.

What's one thing you wish you could go back and tell yourself before applying? 

I would love to tell my past self (and anyone else with a disability who is considering applying to Google) that Google will not let you down because of your disability. I was once a very unconfident person because I was always left behind during conversations and felt helpless. Google’s mission statement is to make the world's information universally accessible and useful, and that applies to the workplace as well. 


Can you tell us about the resources you used to prepare for your interview or role? 

Before applying to Google as a grad student, I had little work experience so I spent lots of time revisiting my past challenges and thinking through how I tried to overcome them. Leadership doesn't only mean leading a group. If you have an experience challenging yourself to achieve a goal, that is also a leadership skill. My advice is to go to the interview fully prepared to share your strengths.


Do you have any other tips you’d like to share with aspiring Googlers?

Be confident and embrace your uniqueness. Also, don't be afraid to share any accommodation needs during the application process. Bring all of yourself to the interview and tell us how amazing you are! 

Ikumi Kobayashi on taking inclusion seriously

Welcome to the latest edition of “My Path to Google,” where we talk to Googlers, interns and alumni about how they got to Google, what their roles are like and even some tips on how to prepare for interviews.


Today’s post is all about Ikumi Kobayashi, a Search Optimization Specialist based out of Tokyo whose search for an inclusive and accessible workplace ultimately led her to her role at Google and a newfound confidence.


Can you tell us about your decision to apply to Google? 

I have profound hearing loss in both ears and use hearing aids. I rely on lip-reading during conversations. As a person with a disability (PwD), I struggled during my job hunt in Japan because most of the companies I applied to had limited job postings for PwD, and the benefits for PwD were often unequal compared to people without a disability. 


I decided to apply to Google because I wanted to work in a company that takes diversity and inclusion seriously. I was nervous before applying to Google because teamwork can be difficult for a hard-of-hearing person like me, but I decided to give it a try because I had nothing to lose.


How would you describe your path to your current role at Google? 

I studied communications in undergrad and joined Google right out of grad school, so Google is the first company I’ve worked at. I was an intern my first year at Google, and during that time my team supported me to overcome anxiety and build confidence as a Googler with a hearing disability. 


I started as a Google Ads Account Manager, but I found face-to-face conversations with many clients everyday difficult and I preferred working more with the product and with my teammates. After three months, I moved to my current team. My job title is now Search Optimization Specialist and my responsibility is to support Japanese companies in the entertainment industry as they run and optimize their Google Search Ads. It is very rewarding to see the companies I support grow and I am really thankful for the previous and current team who accommodated flexibly for me.

Ten people gathered around a table inside of a restaurant.

Ikumi and teammates out at dinner in 2019.

What does your typical day look like right now? 

After our Google Tokyo office completely shut down in March 2020, I have been working remotely in my apartment in Tokyo. I really miss meeting my teammates and friends in the office, but I keep myself energized by proactively setting up meetings as much as possible. Conversations with Googlers always help me to maximize my productivity. Outside of work, I'm a fashion enthusiast and go to a fashion design school three times a week after work. I love to watch fashion shows on YouTube during my free time.


What inspires you to come in (or log on) every day?

I am passionate about advocating for diversity, inclusion and accessibility so I joined the Disability Alliance — an employee resource group for Googlers. Right now, I am the only Japanese hard-of-hearing Googler on the Google Ads team and we can do more to diversify the Asia-Pacific Google community. I strive to do my best to make our community even more accessible for Googlers with disabilities.

Ikumi speaking into a microphone in front of a large group. A slide is projected behind her introducing herself.

What's one thing you wish you could go back and tell yourself before applying? 

I would love to tell my past self (and anyone else with a disability who is considering applying to Google) that Google will not let you down because of your disability. I was once a very unconfident person because I was always left behind during conversations and felt helpless. Google’s mission statement is to make the world's information universally accessible and useful, and that applies to the workplace as well. 


Can you tell us about the resources you used to prepare for your interview or role? 

Before applying to Google as a grad student, I had little work experience so I spent lots of time revisiting my past challenges and thinking through how I tried to overcome them. Leadership doesn't only mean leading a group. If you have an experience challenging yourself to achieve a goal, that is also a leadership skill. My advice is to go to the interview fully prepared to share your strengths.


Do you have any other tips you’d like to share with aspiring Googlers?

Be confident and embrace your uniqueness. Also, don't be afraid to share any accommodation needs during the application process. Bring all of yourself to the interview and tell us how amazing you are! 

Chrome can now caption audio and video

Captions make online content more accessible. If you’re in a noisy environment, trying to keep the volume down, or are part of the 466 million people in the world who are deaf or hard of hearing, having captions lets you follow along to whatever content you are watching — whether it’s viral feta pasta videos, breaking news or a scientist discussing their latest research. 


Unfortunately, captions aren’t always available for every piece of content. Now with Live Caption on Chrome, you can automatically generate real-time captions for media with audio on your browser. It works across social and video sites, podcasts and radio content, personal video libraries (such as Google Photos), embedded video players, and most web-based video or audio chat services.

Screen recording showing the steps to turn on Live Caption feature in Chrome followed by demonstration of the feature in use to add captions to a video of a dog
10:25

Turn on Live Caption in Chrome to see captions for media with audio played in your browser window

Laura D’Aquila, a software engineer on Google Workspace who is hard of hearing, tested out the feature early on. “With Live Caption, I no longer have to miss out on watching videos because of lack of captions, and I can engage in real-life conversations with family, friends or colleagues about this content. Just recently, my coworker sent a video to our team's chat, but it was not captioned. With Live Caption I was able to follow along and share my reactions to the video with my team.” 


These captions in Chrome are created on-device, which allows the captions to appear as the content plays without ever having to leave your computer. Live Caption also works offline, so you can even caption audio and video files saved on your hard drive when you play them in Chrome.  


To turn on Live Caption in Chrome from your desktop, go to Chrome Settings, click on the Advanced section, then go to the Accessibility section. The feature currently supports English and is available globally on the latest release of Chrome on Windows, Mac and Linux devices and will be coming soon to ChromeOS. For Android devices, Live Caption is already available for any audio or video on your mobile device.

Your Android is now even safer — and 5 other new features

It wasn't all that long ago that we introduced Android users to features like Emoji Kitchen and auto-narrated audiobooks. But we like to stay busy, so today we're highlighting six of the latest Google updates that will make Android phones more secure and convenient — for everyone.

1. Keep your accounts safe with Password Checkup on Android

Password Checkup notification screen

Password Checkup notification screen

On Android, you can save passwords to your Google account, making it quicker and easier to sign into your apps and services using Autofill. Your login credentials are one of your first lines of defense against intruders, so we’ve integrated Password Checkup into devices running Android 9 and above. This feature lets you know if the password you used has been previously exposed and what to do about it.


Now when you enter a password into an app on your phone using Autofill with Google, we’ll check those credentials against a list of known compromised passwords — that is, passwords that have potentially already been stolen and posted on the web. If your credentials show up on one of these lists, we’ll alert you and guide you to check your password and change it. 


Learn more on our support page about changing unsafe passwords. And you can find additional information about how this product works in this blog post.


We’re passionate about building defense into every detail on Android, from downloading apps to browsing the web to choosing where and when you share your data. Learn more about how Android keeps you safe.

2. Use schedule send in Messages to write a text now and send it later

Schedule a text to send it at your chosen date and time

Click on the image above to learn how to schedule a text to send at your chosen date and time

Over half a billion people across the world use Messages to seamlessly and safely connect with family, friends and others every month. To continue  improving the way you communicate and help you stay in touch, we’re starting to roll out schedule send in Messages for phones running Android 7 and newer. 


Having loved ones in another time zone or on a different schedule can sometimes make it difficult to send a text at an appropriate time. With schedule send, you can compose a message ahead of time when it’s convenient for you, and schedule it to send at the right moment. Just write your message as you normally would, then hold and press the send button to select a date and time to deliver your message. Download Messages or update to the latest version to schedule your next text.

3. No need to look at your screen, with TalkBack

Start and stop media with Talkback gestures

Click on the image above to see how to start and stop media with Talkback gestures 

For those who are blind or have trouble seeing the display, the new version of TalkBack, Android’s screen reader, is now available. Using spoken feedback and gestures, TalkBack makes Android even more accessible and opens up a full phone experience without needing to look at your screen. We worked closely with the blind and low vision communities on this revamp of TalkBack to incorporate the most popularly requested features including: more intuitive gestures, a unified menu, a new reading control menu and more. Get TalkBack today by downloading or updating your Android accessibility apps in the Google Play Store.

4. Get more done hands-free with Google Assistant

Use Google Assistant to send a text, even when your phone is locked

Use Google Assistant to send a text, even when your phone is locked

We want to give you more ways to use your phone hands-free — so you can do things like use your voice to make calls, set timers or alarms and play music. Now, the latest updates to Google Assistant make it easier to get things done on your phone without needing to be right next to it.


Assistant now works better even when your phone is locked or across the room with new cards that can be read with just a glance. Just say “Hey Google, set an alarm” or “Hey Google, play pop music on Spotify.” To get the most out of Assistant when your phone is locked, simply turn on Lock Screen Personal Results in Assistant setting and say “Hey Google '' to send text messages and make calls.

5. Come to the dark side with dark theme in Google Maps 

San Francisco on Google Maps dark theme

San Francisco on Google Maps dark theme

These days, we’re all experiencing a bit of screen fatigue. With dark theme in Google Maps soon expanding to all Android users globally, you can give your eyes a much-needed break and save on battery life. Simply head to your Settings, tap on Theme and then on “Always in Dark Theme” to lower the lights when you’re navigating, exploring, or getting things done with Maps. Change your mind? Just tap on “Always in Light Theme” to switch it back.

6. A better drive with Android Auto

Stay entertained with voice-activated games on your display with Android Auto

Stay entertained with voice-activated games on your display with Android Auto

Android Auto’s new features help you enjoy the drive more. With custom wallpapers, you can now select from a variety of car-inspired backgrounds to personalize your car display. For longer drives, you and your passengers can stay entertained with voice-activated games like trivia and “Jeopardy!” Just say, “Hey Google, play a game” to get started. 


We’ve also launched shortcuts on the launch screen. These provide convenient access to your contacts and even allow you to use Assistant to complete tasks like checking the weather or remotely adjusting the thermostat by simply tapping on the icon on your car display, just as you would on your phone. For cars with wider screens, you can do more with a split-screen that features a real-time view of Google Maps and media controls. And if you have family and friends coming along for the ride, you can now set a privacy screen to control when Android Auto appears on your car display. 


These Android Auto features will be available in the coming days on phones running Android 6.0 or above, and when connected to your compatible car.

Source: Android


Our all-new TalkBack screen reader

To blind traveling bluesman Joshua Pearson, songwriting is more than just a good melody. “Songwriting gave me a language to talk about my frustrations. And by putting my music out there, I could hopefully let somebody else feel some of what I was feeling.” For Joshua, TalkBack is his main pen and paper for writing songs; it lets him dictate lyrics into his phone and hear them told back to him.


Screen readers, such as Android’s TalkBack, are the primary interface through which Joshua and many other people who are blind or low vision read, write, send emails, share social media, order delivery and even write music. TalkBack speaks the screen aloud, navigates through apps, and facilitates communication with braille, voice and keyboard input. And today we’re releasing an all-new version of TalkBack that includes some of the most highly requested features from the blind and low vision community.

Tap as you please with multi-finger gestures

We’ve added a dozen easy-to-learn and easy-to-use multi-finger gestures that are available with the latest version of TalkBack on Pixel and ​Samsung Galaxy devices from One UI 3 onwards. These gestures make it easier for you to interact with apps and let you perform common actions, such as selecting and editing text, controlling media and finding help. 

We worked closely with people in the blind and low vision community to develop these easy-to-remember gestures and make sure they felt natural. For example, instead of navigating through multiple menus and announcements to start or stop your favorite podcast, it's now as simple as double tapping the screen with two fingers. 

Read or skim with just a swipe

Reading and listening is easier with new controls that help you find the most relevant information. For instance, you can swipe right or left with three fingers to hear only the headlines, listen word-by-word or even character-by-character. And then with a single swipe up or down you can navigate through the text. 

Say what? There’s new Voice Commands 

Starting with TalkBack 9.1, you can now swipe up and right to use TalkBack’s new voice commands. TalkBack will stop talking and await your instructions. With over 25 different commands, you can say “find” to locate text on the screen or “increase speech rate” to make TalkBack speak more quickly. 

Do things your way with more customization and language options

While we put a lot of thought into this redesign, one thing we’ve learned from working with the community is that everyone interacts with their phones in their own way — which makes customization important. You can now add or remove options in the TalkBack menu or reading controls. Additionally, gestures can be assigned or reassigned to scores of settings, actions and navigation controls.

Lastly, we’re adding support for two new languages in TalkBack’s braille keyboard: Arabic and Spanish.

Joining forces for accessibility

The all-new TalkBack is the result of our collaboration with trusted testers and Samsung, who co-developed this release. ​TalkBack is now the default screen reader on all ​Samsung Galaxy devices from One UI 3 onwards, making it easier to enjoy a consistent and productive screen reader experience across even more devices.

To help everyone keep up with all the changes, we’ve created an entirely new tutorial to make it easier to make the most of TalkBack — there’s even a test pad to practice new gestures. With these new features and collaborations we hope that more people can find useful and creative ways to use TalkBack. Who knows, you might even find lyrical inspiration like Joshua. 

Source: Android


Try it on: Connected clothing that helps everyone

Jacquard by Google aims to simplify your digital life by turning everyday things, like sneakers and jackets, into intuitive interfaces. A connected jacket with woven Jacquard technology lets people connect to their smartphone and use simple gestures to trigger functions from the Jacquard app. With this interactivity and connectivity built in, you can tap your sleeve to hear directions to your next destination or brush your cuff to change the song playing on your compatible music service. Jacquard technology works for phones running Android 6.0.1 or newer and iOS 11 or newer.

As a team, we’re motivated to understand how connected garment technology can provide access to digital services in situations where traditional mobile devices are difficult or inconvenient to use. As part of that goal, we started a series of research projects to explore and discover how Jacquard technology can help people with disabilities live more independent lives. 

We worked with Champions Place, a shared living residence for young adults with disabilities in the greater Atlanta area. Residents at Champions Place tried out the Jacquard Levi’s ® Commuter Trucker Jacket and let us know how a connected garment could be even more helpful to each of them. 

We discovered that for the residents at Champions Place a connected jacket gave them a simple and unobtrusive way to access technology on the go. For example, many residents at Champions Place commonly rely on emergency call solutions—usually a device worn around the neck that lets them quickly call for assistance. Those who use these devices imagined how the connected garment could be used as a discreet and less obtrusive alternative while blending into their daily lives. 

Once technology becomes part of the things you wear every day, fashion choices become as important as function. One resident trying out the Jacquard connected jackets admitted, “I am not necessarily a jean jacket person. I am thinking it will be useful that I can have a band that can be slipped on, underneath different sleeves or jackets, that way it is not tied to one piece clothing.” It’s feedback like that, that helps us to explore design solutions that people want to wear. We learned that fashion style and form factors, like a smart jacket or connected patch, matter, and one solution doesn’t fit all. 

Our work with Champions Place has just started. So far, the feedback has helped us envision how technology like Jacquard can help people live more independent lives without sacrificing style. Enhancing everyday objects with digital functionalities can lead to products that are helpful, comfortable, easy-to-use and stylish for everyone — including people with disabilities.

Learn more about Jacquard by Google.   

Chromebooks get an education refresh

Chromebooks — which last year were largely used as classroom tools for writing reports and working on projects — are now the main way many students go to school. As distance learning takes place around the world, educators and students have had to quickly adapt to teaching and learning through Chromebooks. And along the way we’ve updated features and tools to make learning from anywhere easier. 

This year, we have 40 new devices and accessibility improvements coming so that every student can learn the way they want to. 

Tools to help educators teach from anywhere 

Teachers have long recorded lessons to help students do homework and study for tests, but in the past year it’s become downright critical for virtual learning. Which is why we’ve built a screen recording tool right into Chrome OS that is coming in the latest Chromebook update in March. With this tool, teachers and students can record lessons and reports in the classroom and at home. 

Screen recorder for Chrome OS

Easier ways for leaders to manage technology

Chrome Education Upgrade unlocks access to Google Admin Console, making it possible for schools to centrally manage massive fleets of Chromebooks. Now, there are over 500 Chrome policies in Google Admin Console, including new ones like Zero Touch Enrollment, which make it easier to deploy and manage Chromebooks at scale — even remotely.

As schools buy hundreds or even thousands of Chromebooks for teachers and students, it’s overwhelming to find the best device to purchase. To make it easier we’ve created a resource to help you find the right Chromebook for whatever you’re looking for — whether it’s in-class learning, virtual learning or devices for faculty and staff.

Updates that equip every student, everywhere  

We’re launching over 40 new Chromebooks. Many of them include convertible Chromebooks that function like a laptop and a tablet, and come with a  stylus, touchscreen, and dual-cameras for students to take notes, edit videos, create podcasts, draw, publish digital books and record screencasts. Every new Chromebook is equipped to deliver exceptional Google Meet and Zoom experiences — right out of the box. We also have devices that can better support students with limited access to the internet, or in countries with strong mobile broadband networks. These devices, called Always Connected devices, have an LTE connectivity option that allows you to connect via your preferred cellular network.

Making education products that work for all students, also means creating accessibility features. And it turns out these features are helpful to everyone — including people with disabilities. ChromeVox, our full-featured screen reader, has new features including improved tutorials, the ability to search ChromeVox menus, and smooth voice switching that automatically changes the screen reader’s voice based on the language of the text. 

We are also making significant audio, video and reliability improvements to Meet on Chromebooks so it continues to work smoothly for everyone. 

Gif of switch access on Chromebooks

How we’re setting the bar higher

As many students are learning from home, it has become even more important for parents and guardians to help support their child’s learning, while also making sure they’re safe online. We’re making it possible for families to add a Google Workspace for Education account to their child’s personal Google Account managed with Family Link. This lets children still log into the apps and websites they need with a school account, while making sure parents can still set guidelines for device and app usage. 

We’ll continue to listen and evolve Google for Education products so they benefit educators, leaders and students. To learn more about all of the upcoming improvements to Chromebooks and Chrome OS, subscribe to our Chrome Enterprise Release notes.

Improving Mobile App Accessibility with Icon Detection

Voice Access enables users to control their Android device hands free, using only verbal commands. In order to function properly, it needs on-screen user interface (UI) elements to have reliable accessibility labels, which are provided to the operating system’s accessibility services via the accessibility tree. Unfortunately, in many apps, adequate labels aren’t always available for UI elements, e.g. images and icons, reducing the usability of Voice Access.

The Voice Access app extracts elements from the view hierarchy to localize and annotate various UI elements. It can provide a precise description for elements that have an explicit content description. On the other hand, the absence of content description can result in many unrecognized elements undermining the ability of Voice Access to function with some apps.

Addressing this challenge requires a system that can automatically detect icons using only the pixel values displayed on the screen, regardless of whether icons have been given suitable accessibility labels. What little research exists on this topic typically uses classifiers, sometimes combined with language models to infer classes and attributes from UI elements. However, these classifiers still rely on the accessibility tree to obtain bounding boxes for UI elements, and fail when appropriate labels do not exist.

Here, we describe IconNet, a vision-based object detection model that can automatically detect icons on the screen in a manner that is agnostic to the underlying structure of the app being used, launched as part of the latest version of Voice Access. IconNet can detect 31 different icon types (to be extended to more than 70 types soon) based on UI screenshots alone. IconNet is optimized to run on-device for mobile environments, with a compact size and fast inference time to enable a seamless user experience. The current IconNet model achieves a mean average precision (mAP) of 94.2% running at 9 FPS on a Pixel 3A.

Voice Access 5.0: the icons detected by IconNet can now be referred to by their names.

Detecting Icons in Screenshots
From a technical perspective, the problem of detecting icons on app screens is similar to classical object detection, in that individual elements are labelled by the model with their locations and sizes. But, in other ways, it’s quite different. Icons are typically small objects, with relatively basic geometric shapes and a limited range of colors, and app screens widely differ from natural images in that they are more structured and geometrical.

A significant challenge in the development of an on-device UI element detector for Voice Access is that it must be able to run on a wide variety of phones with a range of performance performance capabilities, while preserving the user’s privacy. For a fast user experience, a lightweight model with low inference latency is needed. Because Voice Access needs to use the labels in response to an utterance from a user (e.g., “tap camera”, or “show labels”) inference time needs to be short (<150 ms on a Pixel 3A) with a model size less than 10 MB.

IconNet
IconNet is based on the novel CenterNet architecture, which extracts features from input images and then predicts appropriate bounding box centers and sizes (in the form of heatmaps). CenterNet is particularly suited here because UI elements consist of simple, symmetric geometric shapes, making it easier to identify their centers than for natural images. The total loss used is a combination of a standard L1 loss for the icon sizes and a modified CornerNet Focal loss for the center predictions, the latter of which addresses icon class imbalances between commonly occurring icons (e.g., arrow backward, menu, more, and star) and underrepresented icons (end call, delete, launch apps, etc.)..

After experimenting with several backbones (MobileNet, ResNet, UNet, etc), we selected the most promising server-side architecture — Hourglass — as a starting point for designing a backbone tailored for icon and UI element detection. While this architecture is perfectly suitable for server side models, vanilla Hourglass backbones are not an option for a model that will run on a mobile device, due to their large size and slow inference time. We restricted our on-device network design to a single stack, and drastically reduced the width of the backbone. Furthermore, as the detection of icons relies on more local features (compared to real objects), we could further reduce the depth of the backbone without adversely affecting the performance. Ablation studies convinced us of the importance of skip connections and high resolution features. For example, trimming skip connections in the final layer reduced the mAP by 1.5%, and removing such connections from both the final and penultimate layers resulted in a decline of 3.5% mAP.

IconNet analyzes the pixels of the screen and identifies the centers of icons by generating heatmaps, which provide precise information about the position and type of the different types of icons present on the screen. This enables Voice Access users to refer to these elements by their name (e.g., “Tap ‘menu”).

Model Improvements
Once the backbone architecture was selected, we used neural architecture search (NAS) to explore variations on the network architecture and uncover an optimal set of training and model parameters that would balance model performance (mAP) with latency (FLOPs). Additionally, we used Fine-Grained Stochastic Architecture Search (FiGS) to further refine the backbone design. FiGS is a differentiable architecture search technique that uncovers sparse structures by pruning a candidate architecture and discarding unnecessary connections. This technique allowed us to reduce the model size by 20% without any loss in performance, and by 50% with only a minor drop of 0.3% in mAP.

Improving the quality of the training dataset also played an important role in boosting the model performance. We collected and labeled more than 700K screenshots, and in the process, we streamlined data collection by using heuristics and auxiliary models to identify rarer icons. We also took advantage of data augmentation techniques by enriching existing screenshots with infrequent icons.

To improve the inference time, we modified our model to run using Neural Networks API (NNAPI) on a variety of Qualcomm DSPs available on many mobile phones. For this we converted the model to use 8-bit integer quantization which gives the additional benefit of model size reduction. After some experimentation, we used quantization aware training to quantize the model, while matching the performance of a server-side floating point model. The quantized model results in a 6x speed-up (700ms vs 110ms) and 50% size reduction while losing only ~0.5% mAP compared to the unquantized model.

Results
We use traditional object detection metrics (e.g., mAP) to measure model performance. In addition, to better capture the use case of voice controlled user actions, we define a modified version of a false positive (FP) detection, where we penalize more incorrect detections for icon classes that are present on the screen. For comparing detections with ground truth, we use the center in region of interest (CIROI), another metric we developed for this work, which returns in a positive match when the center of the detected bounding box lies inside the ground truth bounding box. This better captures the Voice Access mode of operation, where actions are performed by tapping anywhere in the region of the UI element of interest.

We compared the IconNet model with various other mobile compatible object detectors, including MobileNetEdgeTPU and SSD MobileNet v2. Experiments showed that for a fixed latency, IconNet outperformed the other models in terms of [email protected] on our internal evaluation set.

Model    [email protected]
IconNet (Hourglass)    96%
IconNet (HRNet)    89%
MobilenetEdgeTPU (AutoML)    91%
SSD Mobilenet v2    88%

The performance advantage of IconNet persists when considering quantized models and models for a fixed latency budget.

Models (Quantized)    [email protected]    Model size    Latency*
IconNet (Currently deployed)    94.20%    8.5 MB    107 ms
IconNet (XS)    92.80%    2.3 MB    102 ms
IconNet (S)    91.70%    4.4 MB    45 ms
MobilenetEdgeTPU (AutoML)    88.90%    7.8 MB    26 ms
*Measured on Pixel 3A.

Conclusion and Future Work
We are constantly working on improving IconNet. Among other things, we are interested in increasing the range of elements supported by IconNet to include any generic UI element, such as images, text, or buttons. We also plan to extend IconNet to differentiate between similar looking icons by identifying their functionality. On the application side, we are hoping to increase the number of apps with valid content descriptions by augmenting developer tools to suggest content descriptions for different UI elements when building applications.

Acknowledgements
This project is the result of joint work with Maria Wang, Tautvydas Misiūnas, Lijuan Liu, Ying Xu, Nevan Wichers, Xiaoxue Zang, Gabriel Schubiner, Abhinav Rastogi, Jindong (JD) Chen, Abhanshu Sharma, Pranav Khaitan, Matt Sharifi and Blaise Aguera y Arcas. We sincerely thank our collaborators Robert Berry, Folawiyo Campbell, Shraman Ray Chaudhuri, Nghi Doan, Elad Eban, Marybeth Fair, Alec Go, Sahil Goel, Tom Hume, Cassandra Luongo, Yair Movshovitz-Attias, James Stout, Gabriel Taubman and Anton Vayvod. We are very grateful to Tom Small for assisting us in preparing the post.

Source: Google AI Blog


Improving Mobile App Accessibility with Icon Detection

Voice Access enables users to control their Android device hands free, using only verbal commands. In order to function properly, it needs on-screen user interface (UI) elements to have reliable accessibility labels, which are provided to the operating system’s accessibility services via the accessibility tree. Unfortunately, in many apps, adequate labels aren’t always available for UI elements, e.g. images and icons, reducing the usability of Voice Access.

The Voice Access app extracts elements from the view hierarchy to localize and annotate various UI elements. It can provide a precise description for elements that have an explicit content description. On the other hand, the absence of content description can result in many unrecognized elements undermining the ability of Voice Access to function with some apps.

Addressing this challenge requires a system that can automatically detect icons using only the pixel values displayed on the screen, regardless of whether icons have been given suitable accessibility labels. What little research exists on this topic typically uses classifiers, sometimes combined with language models to infer classes and attributes from UI elements. However, these classifiers still rely on the accessibility tree to obtain bounding boxes for UI elements, and fail when appropriate labels do not exist.

Here, we describe IconNet, a vision-based object detection model that can automatically detect icons on the screen in a manner that is agnostic to the underlying structure of the app being used, launched as part of the latest version of Voice Access. IconNet can detect 31 different icon types (to be extended to more than 70 types soon) based on UI screenshots alone. IconNet is optimized to run on-device for mobile environments, with a compact size and fast inference time to enable a seamless user experience. The current IconNet model achieves a mean average precision (mAP) of 94.2% running at 9 FPS on a Pixel 3A.

Voice Access 5.0: the icons detected by IconNet can now be referred to by their names.

Detecting Icons in Screenshots
From a technical perspective, the problem of detecting icons on app screens is similar to classical object detection, in that individual elements are labelled by the model with their locations and sizes. But, in other ways, it’s quite different. Icons are typically small objects, with relatively basic geometric shapes and a limited range of colors, and app screens widely differ from natural images in that they are more structured and geometrical.

A significant challenge in the development of an on-device UI element detector for Voice Access is that it must be able to run on a wide variety of phones with a range of performance performance capabilities, while preserving the user’s privacy. For a fast user experience, a lightweight model with low inference latency is needed. Because Voice Access needs to use the labels in response to an utterance from a user (e.g., “tap camera”, or “show labels”) inference time needs to be short (<150 ms on a Pixel 3A) with a model size less than 10 MB.

IconNet
IconNet is based on the novel CenterNet architecture, which extracts features from input images and then predicts appropriate bounding box centers and sizes (in the form of heatmaps). CenterNet is particularly suited here because UI elements consist of simple, symmetric geometric shapes, making it easier to identify their centers than for natural images. The total loss used is a combination of a standard L1 loss for the icon sizes and a modified CornerNet Focal loss for the center predictions, the latter of which addresses icon class imbalances between commonly occurring icons (e.g., arrow backward, menu, more, and star) and underrepresented icons (end call, delete, launch apps, etc.)..

After experimenting with several backbones (MobileNet, ResNet, UNet, etc), we selected the most promising server-side architecture — Hourglass — as a starting point for designing a backbone tailored for icon and UI element detection. While this architecture is perfectly suitable for server side models, vanilla Hourglass backbones are not an option for a model that will run on a mobile device, due to their large size and slow inference time. We restricted our on-device network design to a single stack, and drastically reduced the width of the backbone. Furthermore, as the detection of icons relies on more local features (compared to real objects), we could further reduce the depth of the backbone without adversely affecting the performance. Ablation studies convinced us of the importance of skip connections and high resolution features. For example, trimming skip connections in the final layer reduced the mAP by 1.5%, and removing such connections from both the final and penultimate layers resulted in a decline of 3.5% mAP.

IconNet analyzes the pixels of the screen and identifies the centers of icons by generating heatmaps, which provide precise information about the position and type of the different types of icons present on the screen. This enables Voice Access users to refer to these elements by their name (e.g., “Tap ‘menu”).

Model Improvements
Once the backbone architecture was selected, we used neural architecture search (NAS) to explore variations on the network architecture and uncover an optimal set of training and model parameters that would balance model performance (mAP) with latency (FLOPs). Additionally, we used Fine-Grained Stochastic Architecture Search (FiGS) to further refine the backbone design. FiGS is a differentiable architecture search technique that uncovers sparse structures by pruning a candidate architecture and discarding unnecessary connections. This technique allowed us to reduce the model size by 20% without any loss in performance, and by 50% with only a minor drop of 0.3% in mAP.

Improving the quality of the training dataset also played an important role in boosting the model performance. We collected and labeled more than 700K screenshots, and in the process, we streamlined data collection by using heuristics and auxiliary models to identify rarer icons. We also took advantage of data augmentation techniques by enriching existing screenshots with infrequent icons.

To improve the inference time, we modified our model to run using Neural Networks API (NNAPI) on a variety of Qualcomm DSPs available on many mobile phones. For this we converted the model to use 8-bit integer quantization which gives the additional benefit of model size reduction. After some experimentation, we used quantization aware training to quantize the model, while matching the performance of a server-side floating point model. The quantized model results in a 6x speed-up (700ms vs 110ms) and 50% size reduction while losing only ~0.5% mAP compared to the unquantized model.

Results
We use traditional object detection metrics (e.g., mAP) to measure model performance. In addition, to better capture the use case of voice controlled user actions, we define a modified version of a false positive (FP) detection, where we penalize more incorrect detections for icon classes that are present on the screen. For comparing detections with ground truth, we use the center in region of interest (CIROI), another metric we developed for this work, which returns in a positive match when the center of the detected bounding box lies inside the ground truth bounding box. This better captures the Voice Access mode of operation, where actions are performed by tapping anywhere in the region of the UI element of interest.

We compared the IconNet model with various other mobile compatible object detectors, including MobileNetEdgeTPU and SSD MobileNet v2. Experiments showed that for a fixed latency, IconNet outperformed the other models in terms of [email protected] on our internal evaluation set.

Model    [email protected]
IconNet (Hourglass)    96%
IconNet (HRNet)    89%
MobilenetEdgeTPU (AutoML)    91%
SSD Mobilenet v2    88%

The performance advantage of IconNet persists when considering quantized models and models for a fixed latency budget.

Models (Quantized)    [email protected]    Model size    Latency*
IconNet (Currently deployed)    94.20%    8.5 MB    107 ms
IconNet (XS)    92.80%    2.3 MB    102 ms
IconNet (S)    91.70%    4.4 MB    45 ms
MobilenetEdgeTPU (AutoML)    88.90%    7.8 MB    26 ms
*Measured on Pixel 3A.

Conclusion and Future Work
We are constantly working on improving IconNet. Among other things, we are interested in increasing the range of elements supported by IconNet to include any generic UI element, such as images, text, or buttons. We also plan to extend IconNet to differentiate between similar looking icons by identifying their functionality. On the application side, we are hoping to increase the number of apps with valid content descriptions by augmenting developer tools to suggest content descriptions for different UI elements when building applications.

Acknowledgements
This project is the result of joint work with Maria Wang, Tautvydas Misiūnas, Lijuan Liu, Ying Xu, Nevan Wichers, Xiaoxue Zang, Gabriel Schubiner, Abhinav Rastogi, Jindong (JD) Chen, Abhanshu Sharma, Pranav Khaitan, Matt Sharifi and Blaise Aguera y Arcas. We sincerely thank our collaborators Robert Berry, Folawiyo Campbell, Shraman Ray Chaudhuri, Nghi Doan, Elad Eban, Marybeth Fair, Alec Go, Sahil Goel, Tom Hume, Cassandra Luongo, Yair Movshovitz-Attias, James Stout, Gabriel Taubman and Anton Vayvod. We are very grateful to Tom Small for assisting us in preparing the post.

Source: Google AI Blog


Improving Mobile App Accessibility with Icon Detection

Voice Access enables users to control their Android device hands free, using only verbal commands. In order to function properly, it needs on-screen user interface (UI) elements to have reliable accessibility labels, which are provided to the operating system’s accessibility services via the accessibility tree. Unfortunately, in many apps, adequate labels aren’t always available for UI elements, e.g. images and icons, reducing the usability of Voice Access.

The Voice Access app extracts elements from the view hierarchy to localize and annotate various UI elements. It can provide a precise description for elements that have an explicit content description. On the other hand, the absence of content description can result in many unrecognized elements undermining the ability of Voice Access to function with some apps.

Addressing this challenge requires a system that can automatically detect icons using only the pixel values displayed on the screen, regardless of whether icons have been given suitable accessibility labels. What little research exists on this topic typically uses classifiers, sometimes combined with language models to infer classes and attributes from UI elements. However, these classifiers still rely on the accessibility tree to obtain bounding boxes for UI elements, and fail when appropriate labels do not exist.

Here, we describe IconNet, a vision-based object detection model that can automatically detect icons on the screen in a manner that is agnostic to the underlying structure of the app being used, launched as part of the latest version of Voice Access. IconNet can detect 31 different icon types (to be extended to more than 70 types soon) based on UI screenshots alone. IconNet is optimized to run on-device for mobile environments, with a compact size and fast inference time to enable a seamless user experience. The current IconNet model achieves a mean average precision (mAP) of 94.2% running at 9 FPS on a Pixel 3A.

Voice Access 5.0: the icons detected by IconNet can now be referred to by their names.

Detecting Icons in Screenshots
From a technical perspective, the problem of detecting icons on app screens is similar to classical object detection, in that individual elements are labelled by the model with their locations and sizes. But, in other ways, it’s quite different. Icons are typically small objects, with relatively basic geometric shapes and a limited range of colors, and app screens widely differ from natural images in that they are more structured and geometrical.

A significant challenge in the development of an on-device UI element detector for Voice Access is that it must be able to run on a wide variety of phones with a range of performance performance capabilities, while preserving the user’s privacy. For a fast user experience, a lightweight model with low inference latency is needed. Because Voice Access needs to use the labels in response to an utterance from a user (e.g., “tap camera”, or “show labels”) inference time needs to be short (<150 ms on a Pixel 3A) with a model size less than 10 MB.

IconNet
IconNet is based on the novel CenterNet architecture, which extracts features from input images and then predicts appropriate bounding box centers and sizes (in the form of heatmaps). CenterNet is particularly suited here because UI elements consist of simple, symmetric geometric shapes, making it easier to identify their centers than for natural images. The total loss used is a combination of a standard L1 loss for the icon sizes and a modified CornerNet Focal loss for the center predictions, the latter of which addresses icon class imbalances between commonly occurring icons (e.g., arrow backward, menu, more, and star) and underrepresented icons (end call, delete, launch apps, etc.)..

After experimenting with several backbones (MobileNet, ResNet, UNet, etc), we selected the most promising server-side architecture — Hourglass — as a starting point for designing a backbone tailored for icon and UI element detection. While this architecture is perfectly suitable for server side models, vanilla Hourglass backbones are not an option for a model that will run on a mobile device, due to their large size and slow inference time. We restricted our on-device network design to a single stack, and drastically reduced the width of the backbone. Furthermore, as the detection of icons relies on more local features (compared to real objects), we could further reduce the depth of the backbone without adversely affecting the performance. Ablation studies convinced us of the importance of skip connections and high resolution features. For example, trimming skip connections in the final layer reduced the mAP by 1.5%, and removing such connections from both the final and penultimate layers resulted in a decline of 3.5% mAP.

IconNet analyzes the pixels of the screen and identifies the centers of icons by generating heatmaps, which provide precise information about the position and type of the different types of icons present on the screen. This enables Voice Access users to refer to these elements by their name (e.g., “Tap ‘menu”).

Model Improvements
Once the backbone architecture was selected, we used neural architecture search (NAS) to explore variations on the network architecture and uncover an optimal set of training and model parameters that would balance model performance (mAP) with latency (FLOPs). Additionally, we used Fine-Grained Stochastic Architecture Search (FiGS) to further refine the backbone design. FiGS is a differentiable architecture search technique that uncovers sparse structures by pruning a candidate architecture and discarding unnecessary connections. This technique allowed us to reduce the model size by 20% without any loss in performance, and by 50% with only a minor drop of 0.3% in mAP.

Improving the quality of the training dataset also played an important role in boosting the model performance. We collected and labeled more than 700K screenshots, and in the process, we streamlined data collection by using heuristics and auxiliary models to identify rarer icons. We also took advantage of data augmentation techniques by enriching existing screenshots with infrequent icons.

To improve the inference time, we modified our model to run using Neural Networks API (NNAPI) on a variety of Qualcomm DSPs available on many mobile phones. For this we converted the model to use 8-bit integer quantization which gives the additional benefit of model size reduction. After some experimentation, we used quantization aware training to quantize the model, while matching the performance of a server-side floating point model. The quantized model results in a 6x speed-up (700ms vs 110ms) and 50% size reduction while losing only ~0.5% mAP compared to the unquantized model.

Results
We use traditional object detection metrics (e.g., mAP) to measure model performance. In addition, to better capture the use case of voice controlled user actions, we define a modified version of a false positive (FP) detection, where we penalize more incorrect detections for icon classes that are present on the screen. For comparing detections with ground truth, we use the center in region of interest (CIROI), another metric we developed for this work, which returns in a positive match when the center of the detected bounding box lies inside the ground truth bounding box. This better captures the Voice Access mode of operation, where actions are performed by tapping anywhere in the region of the UI element of interest.

We compared the IconNet model with various other mobile compatible object detectors, including MobileNetEdgeTPU and SSD MobileNet v2. Experiments showed that for a fixed latency, IconNet outperformed the other models in terms of [email protected] on our internal evaluation set.

Model    [email protected]
IconNet (Hourglass)    96%
IconNet (HRNet)    89%
MobilenetEdgeTPU (AutoML)    91%
SSD Mobilenet v2    88%

The performance advantage of IconNet persists when considering quantized models and models for a fixed latency budget.

Models (Quantized)    [email protected]    Model size    Latency*
IconNet (Currently deployed)    94.20%    8.5 MB    107 ms
IconNet (XS)    92.80%    2.3 MB    102 ms
IconNet (S)    91.70%    4.4 MB    45 ms
MobilenetEdgeTPU (AutoML)    88.90%    7.8 MB    26 ms
*Measured on Pixel 3A.

Conclusion and Future Work
We are constantly working on improving IconNet. Among other things, we are interested in increasing the range of elements supported by IconNet to include any generic UI element, such as images, text, or buttons. We also plan to extend IconNet to differentiate between similar looking icons by identifying their functionality. On the application side, we are hoping to increase the number of apps with valid content descriptions by augmenting developer tools to suggest content descriptions for different UI elements when building applications.

Acknowledgements
This project is the result of joint work with Maria Wang, Tautvydas Misiūnas, Lijuan Liu, Ying Xu, Nevan Wichers, Xiaoxue Zang, Gabriel Schubiner, Abhinav Rastogi, Jindong (JD) Chen, Abhanshu Sharma, Pranav Khaitan, Matt Sharifi and Blaise Aguera y Arcas. We sincerely thank our collaborators Robert Berry, Folawiyo Campbell, Shraman Ray Chaudhuri, Nghi Doan, Elad Eban, Marybeth Fair, Alec Go, Sahil Goel, Tom Hume, Cassandra Luongo, Yair Movshovitz-Attias, James Stout, Gabriel Taubman and Anton Vayvod. We are very grateful to Tom Small for assisting us in preparing the post.

Source: Google AI Blog