Tag Archives: accessibility

Conceptual Captions: A New Dataset and Challenge for Image Captioning



The web is filled with billions of images, helping to entertain and inform the world on a countless variety of subjects. However, much of that visual information is not accessible to those with visual impairments, or with slow internet speeds that prohibit the loading of images. Image captions, manually added by website authors using Alt-text HTML, is one way to make this content more accessible, so that a natural-language description for images that can be presented using text-to-speech systems. However, existing human-curated Alt-text HTML fields are added for only a very small fraction of web images. And while automatic image captioning can help solve this problem, accurate image captioning is a challenging task that requires advancing the state of the art of both computer vision and natural language processing.
Image captioning can help millions with visual impairments by converting images captions to text. Image by Francis Vallance (Heritage Warrior), used under CC BY 2.0 license.
Today we introduce Conceptual Captions, a new dataset consisting of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages. Introduced in a paper presented at ACL 2018, Conceptual Captions represents an order of magnitude increase of captioned images over the human-curated MS-COCO dataset. As measured by human raters, the machine-curated Conceptual Captions has an accuracy of ~90%. Furthermore, because images in Conceptual Captions are pulled from across the web, it represents a wider variety of image-caption styles than previous datasets, allowing for better training of image captioning models. To track progress on image captioning, we are also announcing the Conceptual Captions Challenge for the machine learning community to train and evaluate their own image captioning models on the Conceptual Captions test bed.
Illustration of images and captions in the Conceptual Captions dataset.
Clockwise from top left, images by Jonny Hunter, SigNote Cloud, Tony Hisgett, ResoluteSupportMedia. All images used under CC BY 2.0 license
Generating the Dataset
To generate the Conceptual Captions dataset, we start by sourcing images from the web that have Alt-text HTML attributes. We automatically screen these for certain properties to ensure image quality while also avoiding undesirable content such as adult themes. We then apply text-based filtering, removing captions with non-descriptive text (such as hashtags, poor grammar or added language that does not relate to the image); we also discard texts with high sentiment polarity or adult content (for more details on the filtering criteria, please see our paper). We use existing image classification models to make sure that, for any given image, there is overlap between its Alt-text (allowing for word variations) and the labels that the image classifier outputs for that image.

From Specific Names to General Concepts
While candidates passing the above filters tend to be good Alt-text image descriptions, a large majority use proper names (for people, venues, locations, organizations etc.). This is problematic because it is very difficult for an image captioning model to learn such fine-grained proper name inference from input image pixels, and also generate natural-language descriptions simultaneously1.

To address the above problems we wrote software that automatically replaces proper names with words representing the same general notion, i.e., with their concept. In some cases, the proper names are removed to simplify the text. For example, we substitute people names (e.g., “Former Miss World Priyanka Chopra on the red carpet” becomes “actor on the red carpet”), remove locations names (“Crowd at a concert in Los Angeles” becomes “Crowd at a concert”), remove named modifiers (e.g., “Italian cuisine” becomes just “cuisine”) and correct newly formed noun phrases if needed (e.g., “artist and artist” becomes “artists”, see the example illustration below).
Illustration of text modification. Image by Rockoleando used under CC BY 2.0 license.
Finally, we cluster all resolved entities (e.g., “artist”, “dog”, “neighborhood”, etc.) and keep only the candidate types which have a count of over 100 mentions, a quantity sufficient to support representation learning for these entities. This retained around 16K entity concepts such as: “person”, “actor”, “artist”, “player” and “illustration”. Less frequent ones that we retained include “baguette”, “bridle”, “deadline”, “ministry” and “funnel”.

In the end, it required roughly one billion (English) webpages containing over 5 billion candidate images to obtain a clean and learnable image caption dataset of over 3M samples (a rejection rate of 99.94%). Our control parameters were biased towards high precision, although these can be tuned to generate an order of magnitude more examples with lower precision.

Dataset Impact
To test the usefulness of our dataset, we independently trained both RNN-based, and Transformer-based image captioning models implemented in Tensor2Tensor (T2T), using the MS-COCO dataset (using 120K images with 5 human annotated-captions per image) and the new Conceptual Captions dataset (using over 3.3M images with 1 caption per image). See our paper for more details on model architectures.

These models were tested using images from Flickr30K dataset (which are out-of-domain for both MS-COCO and Conceptual Captions), and the resulting captions evaluated using 3 human raters per test case. The results are reported in the table below.
From these results we conclude that models trained on Conceptual Captions generalized better than competing approaches irrespective of the architecture (i.e., RNN or Transformer). In addition, we found that Transformer models did better than RNN when trained on either dataset. The conclusion from these findings is that Conceptual Captions provides the ability to train image captioning models that perform better on a wide variety of images.

Get Involved
It is our hope that this dataset will help the machine learning community advance the state of the art in image captioning models. Importantly, since no human annotators were involved in its creation, this dataset is highly scalable, potentially allowing the expansion of the dataset to enable automatic creation of Alt-text-HTML-like descriptions for an even wider variety of images. We encourage all those interested to partake in the Conceptual Captions Challenge, and we look forward to seeing what the community can do! For more details and the latest results please visit the challenge website.

Acknowledgements
Thanks to Nan Ding, Sebastian Goodman and Bo Pang for training models with Conceptual Captions dataset, and to Amol Wankhede for driving the public release efforts for the dataset.


1 In our paper, we posit that if automatic determination of names, locations, brands, etc. from the image is needed, it should be done as a separate task that may leverage image meta-information (e.g. GPS info), or complementary techniques such as OCR.

Source: Google AI Blog


Making Morse code available to more people on Gboard

Earlier this year, we partnered with developer Tania Finlayson, an expert in Morse code assistive technology, to make Morse code more accessible. Today, we’re rolling out Morse code on Gboard for iOS and improvements to Morse code on Gboard for Android. To help you learn how to type in Morse code, we’ve created a game (on Android, iOS, and desktop) that can help you learn it in less than an hour! We’ve worked closely with Tania on these updates to the keyboard and more—here, she explains how Morse code changed her life:

My name is Tania Finlayson, and I was born with cerebral palsy. A few doctors told my parents that I probably would not amount to anything, and suggested my parents put me in an institution. Luckily, my parents did not take the advice, raised me like a normal child, and did not expect any less of me throughout my childhood. I had to eat my dinner first before I could have desserts, I had to go to bed at bedtime, and I got in trouble when I picked on my older brother.

The only difference was that I was not able to communicate very effectively; basically, I could only answer “yes” and “no” questions. When I was old enough to read, I used a communication word board with about 200 words on it. I used a head stick to point to the words. A couple of years later, my dad decided that I should try a typewriter and press the keys with the head stick. Amazingly, my vocabulary grew. My mom did not dress me in plaid any more, I could tell on my brother, and I finally had the chance to annoy my Dad with question after question about the world. I am quite sure that my Dad did not, in any way, regret letting me try a typewriter. Ha!

Several years later, I was one of four kids chosen to participate in a study for non-verbal children at the University of Washington. The study was led by Al Ross, who wrote a grant funding the creation of a Morse code communicator for disabled children. Morse code, which is a communication system that dates back to the 1800s, allowed us to spell out words and communicate just by using two buttons: a dot “.” and a dash “—”.

The device was revolutionary.  It would convert my Morse code into letters then speak out loud in English and had a small printer installed in it.  I could activate a light to “raise my hand in class.” At first I thought learning Morse code would be a waste of time, but soon learned that it gave me total freedom with my words, and for the first time, I could talk with ease, without breaking my neck. School became fun, instead of exhausting. I could focus on my studies, and have real conversations with my friends for the first time. Also, I did not need an adult figure with me every moment at school, and that was awesome.

My experience with the Morse code communicator led me to a partnership with Google on bringing Morse code to Gboard. Working closely with the team, I helped design the keyboard layout, added Morse sequences to the auto-suggestion strip above the keyboard, and developed settings that allow people to customize the keyboard to their unique needs. The Morse code keyboard on Gboard allows people to use Morse code (dots and dashes) to enter text, instead of the regular (QWERTY) keyboard. Gboard for Android lets you hook external switches to the device (check out the source code my husband Ken and I developed), so a person with limited mobility could operate the device.

gboard ios

I’m excited to see what people will build that integrates with Morse code—whether it’s a keyboard like Gboard, a game, or educational app, the possibilities are endless. Most technology today is designed for the mass market. Unfortunately, this can mean that people with disabilities can be left behind. Developing communication tools like this is important, because for many people, it simply makes life livable. Now, if anyone wants to try Morse code, they can use the phone in their pocket. Just by downloading an app, anyone anywhere can give communicating with Morse code a try.

When I was first able to communicate as a child, the first feeling that I had was “Wow! This is pretty far out!” The first thing I typed was “You’re an old fart, Dad!” That was the first time I saw him laugh with tears in his eyes; I still don’t know if I made him really laugh or if I made him really sad! Probably a little of both.

Start making your business more accessible using Primer

Posted by Lisa Gevelber, VP Marketing Ads and Americas

Over one billion people in the world have some form of disability.

That's why we make accessibility a core consideration when we develop new products—from concept to launch and beyond. It's good for users and good for business: Building products that don't consider a diverse range of needs could mean missing a substantial group of potential users and customers.

But impairments and disabilities are as varied as people themselves. For designers, developers, marketers or small business owners, making your products and designs more accessible might seem like a daunting task. How can you make sure you're being more inclusive? Where do you start?

Today, Global Accessibility Awareness Day, we're launching a new suite of resources to help creators, marketers, and designers answer those questions and build more inclusive products and designs.

The first step is learning about accessibility. Simply start by downloading the Google Primer app and search "accessibility." You'll find five-minute lessons that help you better understand accessibility, and learn practical tips to start making your own business, products and designs more accessible, like key design principles for building a more accessible website. You may even discover that addressing accessibility issues can improve the user experience for everyone. For instance, closed captions can make your videos accessible to more people whether they have a hearing impairment or are sitting in a crowded room.

Next, visit the Google Accessibility page and discover free tools that can help you make your site or app more accessible for more people. The Android Developers site also contains a wide range of suggestions to help you improve the accessibility of your app.

We hope these resources will help you join us in designing and building for a more inclusive future. After all, an accessible web and world is a better one—both for people and for business.

"Excited to see the new lessons on accessibility that Primer launched today. They help us learn how to start making websites and products more accessible. With over 1 billion people in the world with some form of disability, building a more inclusive web is the right thing to do both for people and for business."