Monthly Archives: August 2019

Bringing Live Transcribe’s Speech Engine to Everyone

Earlier this year, Google launched Live Transcribe, an Android application that provides real-time automated captions for people who are deaf or hard of hearing. Through many months of user testing, we've learned that robustly delivering good captions for long-form conversations isn't so easy, and we want to make it easier for developers to build upon what we've learned. Live Transcribe's speech recognition is provided by Google's state-of-the-art Cloud Speech API, which under most conditions delivers pretty impressive transcript accuracy. However, relying on the cloud introduces several complications—most notably robustness to ever-changing network connections, data costs, and latency. Today, we are sharing our transcription engine with the world so that developers everywhere can build applications with robust transcription.

Those who have worked with our Cloud Speech API know that sending infinitely long streams of audio is currently unsupported. To help solve this challenge, we take measures to close and restart streaming requests prior to hitting the timeout, including restarting the session during long periods of silence and closing whenever there is a detected pause in the speech. Otherwise, this would result in a truncated sentence or word. In between sessions, we buffer audio locally and send it upon reconnection. This reduces the amount of text lost mid-conversation—either due to restarting speech requests or switching between wireless networks.



Endlessly streaming audio comes with its own challenges. In many countries, network data is quite expensive and in spots with poor internet, bandwidth may be limited. After much experimentation with audio codecs (in particular, we evaluated the FLAC, AMR-WB, and Opus codecs), we were able to achieve a 10x reduction in data usage without compromising accuracy. FLAC, a lossless codec, preserves accuracy completely, but doesn't save much data. It also has noticeable codec latency. AMR-WB, on the other hand, saves a lot of data, but delivers much worse accuracy in noisy environments. Opus was a clear winner, allowing data rates many times lower than most music streaming services while still preserving the important details of the audio signal—even in noisy environments. Beyond relying on codecs to keep data usage to a minimum, we also support using speech detection to close the network connection during extended periods of silence. That means if you accidentally leave your phone on and running Live Transcribe when nobody is around, it stops using your data.

Finally, we know that if you are relying on captions, you want them immediately, so we've worked hard to keep latency to a minimum. Though most of the credit for speed goes to the Cloud Speech API, Live Transcribe's final trick lies in our custom Opus encoder. At the cost of only a minor increase in bitrate, we see latency that is visually indistinguishable to sending uncompressed audio.

Today, we are excited to make all of this available to developers everywhere. We hope you'll join us in trying to build a world that is more accessible for everyone.

By Chet Gnegy, Alex Huang, and Ausmus Chang from the Live Transcribe Team

Joint Speech Recognition and Speaker Diarization via Sequence Transduction



Being able to recognize “who said what,” or speaker diarization, is a critical step in understanding audio of human dialog through automated means. For instance, in a medical conversation between doctors and patients, “Yes” uttered by a patient in response to “Have you been taking your heart medications regularly?” has a substantially different implication than a rhetorical “Yes?” from a physician.

Conventional speaker diarization (SD) systems use two stages, the first of which detects changes in the acoustic spectrum to determine when the speakers in a conversation change, and the second of which identifies individual speakers across the conversation. This basic multi-stage approach is almost two decades old, and during that time only the speaker change detection component has improved.

With the recent development of a novel neural network model—the recurrent neural network transducer (RNN-T)—we now have a suitable architecture to improve the performance of speaker diarization addressing some of the limitations of the previous diarization system we presented recently. As reported in our recent paper, “Joint Speech Recognition and Speaker Diarization via Sequence Transduction,” to be presented at Interspeech 2019, we have developed an RNN-T based speaker diarization system and have demonstrated a breakthrough in performance from about 20% to 2% in word diarization error rate—a factor of 10 improvement.

Conventional Speaker Diarization Systems
Conventional speaker diarization systems rely on differences in how people sound acoustically to distinguish the speakers in the conversations. While male and female speakers can be identified relatively easily from their pitch using simple acoustic models (e.g., Gaussian mixture models) in a single stage, speaker diarization systems use a multi-stage approach to distinguish between speakers having potentially similar pitch. First, a change detection algorithm breaks up the conversation into homogeneous segments, hopefully containing only a single speaker, based upon detected vocal characteristics. Then, deep learning models are employed to map segments from each speaker to an embedding vector. Finally, in a clustering stage, these embeddings are grouped together to keep track of the same speaker across the conversation.

In practice, the speaker diarization system runs in parallel to the automatic speech recognition (ASR) system and the outputs of the two systems are combined to attribute speaker labels to the recognized words.
Conventional speaker diarization system infers speaker labels in the acoustic domain and then overlays the speaker labels on the words generated by a separate ASR system.
There are several limitations with this approach that have hindered progress in this field. First, the conversation needs to be broken up into segments that only contain speech from one speaker. Otherwise, the embedding will not accurately represent the speaker. In practice, however, the change detection algorithm is imperfect, resulting in segments that may contain multiple speakers. Second, the clustering stage requires that the number of speakers be known and is particularly sensitive to the accuracy of this input. Third, the system needs to make a very difficult trade-off between the segment size over which the voice signatures are estimated and the desired model accuracy. The longer the segment, the better the quality of the voice signature, since the model has more information about the speaker. This comes at the risk of attributing short interjections to the wrong speaker, which could have very high consequences, for example, in the context of processing a clinical or financial conversation where affirmation or negation needs to be tracked accurately. Finally, conventional speaker diarization systems do not have an easy mechanism to take advantage of linguistic cues that are particularly prominent in many natural conversations. An utterance, such as “How often have you been taking the medication?” in a clinical conversation is most likely uttered by a medical provider, not a patient. Likewise, the utterance, “When should we turn in the homework?” is most likely uttered by a student, not a teacher. Linguistic cues also signal high probability of changes in speaker turns, for example, after a question.

There are a few exceptions to the conventional speaker diarization system, but one such exception was reported in our recent blog post. In that work, the hidden states of the recurrent neural network (RNN) tracked the speakers, circumventing the weakness of the clustering stage. Our approach takes a different approach and incorporates linguistic cues, as well.

An Integrated Speech Recognition and Speaker Diarization System
We developed a novel and simple model that not only combines acoustic and linguistic cues seamlessly, but also combines speaker diarization and speech recognition into one system. The integrated model does not degrade the speech recognition performance significantly compared to an equivalent recognition only system.

The key insight in our work was to recognize that the RNN-T architecture is well-suited to integrate acoustic and linguistic cues. The RNN-T model consists of three different networks: (1) a transcription network (or encoder) that maps the acoustic frames to a latent representation, (2) a prediction network that predicts the next target label given the previous target labels, and (3) a joint network that combines the output of the previous two networks and generates a probability distribution over the set of output labels at that time step. Note, there is a feedback loop in the architecture (diagram below) where previously recognized words are fed back as input, and this allows the RNN-T model to incorporate linguistic cues, such as the end of a question.
An integrated speech recognition and speaker diarization system where the system jointly infers who spoke when and what.
Training the RNN-T model on accelerators like graphical processing units (GPU) or tensor processing units (TPU) is non-trivial as computation of the loss function requires running the forward-backward algorithm, which includes all possible alignments of the input and the output sequences. This issue was addressed recently in a TPU friendly implementation of the forward-backward algorithm, which recasts the problem as a sequence of matrix multiplications. We also took advantage of an efficient implementation of the RNN-T loss in TensorFlow that allowed quick iterations of model development and trained a very deep network.

The integrated model can be trained just like a speech recognition system. The reference transcripts for training contain words spoken by a speaker followed by a tag that defines the role of the speaker. For example, “When is the homework due?” ≺student≻, “I expect you to turn them in tomorrow before class,” ≺teacher≻. Once the model is trained with examples of audio and corresponding reference transcripts, a user can feed in the recording of the conversation and expect to see an output in a similar form. Our analyses show that improvements from the RNN-T system impact all categories of errors, including short speaker turns, splitting at the word boundaries, incorrect speaker assignment in the presence of overlapping speech, and poor audio quality. Moreover, the RNN-T system exhibited consistent performance across conversation with substantially lower variance in average error rate per conversation compared to the conventional system.

A comparison of errors committed by the conventional system vs. the RNN-T system, as categorized by human annotators.
Furthermore, this integrated model can predict other labels necessary for generating more reader-friendly ASR transcripts. For example, we have been able to successfully improve our transcripts with punctuation and capitalization symbols using the appropriately matched training data. Our outputs have lower punctuation and capitalization errors than our previous models that were separately trained and added as a post-processing step after ASR.

This model has now become a standard component in our project on understanding medical conversations and is also being adopted more widely in our non-medical speech services.

Acknowledgements
We would like to thank Hagen Soltau without whose contributions this work would not have been possible. This work was performed in collaboration with Google Brain and Speech teams.

Source: Google AI Blog


Beta Channel Update for Chrome OS

The Beta channel has been updated to 77.0.3865.35 (Platform version: 12371.22.0) for most Chrome OS devices. This build contains a number of bug fixes, security updates and feature enhancements.

If you find new issues, please let us know by visiting our forum or filing a bug. Interested in switching channels? Find out how. You can submit feedback using ‘Report an issue...’ in the Chrome menu (3 vertical dots in the upper right corner of the browser).

Daniel Gagnon
Google Chrome

Using your digital superpowers for good

It’s National Nonprofit Day! To mark the occasion, we’re excited to have a post from Jeff Hilimire, founder of 48in48, one of our nonprofit partners dedicated to helping other nonprofits make the most of their online presence. Jeff is also CEO of Atlanta-based Dragon Army.



“How do I find a way for my team members to use their skills in digital marketing to help nonprofits in Atlanta?” 


That was the question I kept asking myself as my first digital agency, Spunlogic, grew to almost 100 employees in the mid-2000s. In those days, we would volunteer every quarter at local nonprofits like soup kitchens, homeless shelters, and food drives — but we weren’t giving back with our greatest strengths, our digital skills, which allowed us to help our paying clients build brands and connect with their customers. 


I wrestled with this question for almost a decade, understanding that in today’s landscape the ability for a nonprofit to connect with donors, volunteers, and team members through digital channels would be paramount to their success. 


Then it hit me:“What if I put on a hackathon that brought together 150 volunteers to build as many nonprofit websites as possible in a single weekend?” 


The idea for 48in48 was born. In 2015, I asked my good friend, Adam Walker, to co-found this new organization with me and we hosted our first event later that year in Atlanta. The idea was to build 48 new websites in 48 hours, pairing great digital talent with nonprofits doing essential work in our communities. That event was so well received that we decided to host a spring event in New York, followed by our second Atlanta event later in the fall of 2016. When both of those events went exceptionally well, we knew we were on to something.


In 2017, we hosted events in Atlanta, New York, Boston, and Minneapolis. And then in 2018, we put on events in six cities: Atlanta, New York, Boston, Chapel Hill (NC), Bloomington (IL), and our first international event in London!


Today, we’ve organized 14 events, helped more than 650 nonprofits, and registered 2,000+ volunteers — and we’re just getting started! 


We haven’t done this alone. Google Fiber has been a sponsor since our first year, and we’ve worked with them in a number of capacities. From serving on our board to bringing their talents to help our nonprofit clients optimize their online presence, our relationship with Google Fiber has allowed us to increase our impact. We’ve used key partnerships like this with other brands too, like Delta Air Lines and State Farm, to help us continue to scale. Without their support, we wouldn’t have been able to dream so big.


"We had a record number of users come to our application this year, which we credit to the ease of access and information on our website. Our online presence finally matches who we are as an organization – forward thinking, efficient, sharp, and in constant pursuit of
excellence. This shift moves us onto a new level for how we talk to the public, and our donors love it! We owe so much of that to the team at 48in48 for giving us an incredible website template,” said Jeannette Rankin, founder of the Women’s Scholarship Fund.


Our goal is to create a service opportunity where 10,000 marketing and technology volunteers can donate their skills for good on an annual basis. We’d love your help in this mission! Find out how you can get involved today at 48in48.org.


At an Atlanta 48in48 hackathon, volunteers work with nonprofit leaders to develop the right digital approach. 


Posted by Guest Blogger Jeff Hilimire, founder, 48in48, and CEO of Dragon Army . ~~~

author.name: Jeff Hilimire
author.title: Founder, 48in48, and CEO, Dragon Army
category: community_impact

Improving Accessibility in the Android Ecosystem

Posted by Ian Stoba, Program Manager, Accessibility Engineering

With billions of Android devices in use around the world and millions of apps available on the Play Store, it might seem difficult to drive change across the entire ecosystem, but the Accessibility Developer Infrastructure team is doing just that.

Every time a developer uploads an APK or app bundle to the open or closed tracks, Play tests this upload on various device models running different versions of Android and generates a pre-launch report to inform the developer of issues.

One year ago, the team added accessibility suggestions to the report based on industry best practices and Google’s own experience. These tests check for common issues that can make an app harder to use by people with disabilities. For example, they check that buttons are large enough to be comfortable for people to press, and that text has enough contrast with the background to be easier to read.

Since launching in July 2018, more than 3.8 million apps have been tested and over 171 million suggestions have been made to improve accessibility. Along with each suggestion, the developer gets detailed information about how to implement it. Every developer, from a one-person startup to a large enterprise, can benefit from the accessibility suggestions in the pre-launch report.

We are already seeing the real-world impact of these efforts. This year at Google I/O, the number of developers signing up for in-person accessibility consultations was four times the number from 2018. Googlers staffing these sessions reported that the developers had specific questions that were often based on the suggestions from the pre-launch report. The focused questions allowed the Googlers to give more actionable recommendations. These developers found that improving accessibility isn't just the right thing to do, it also makes good business sense by increasing the potential market for their apps.

Accessibility tests in the pre-launch report are just one way Google is raising awareness about accessibility in the global developer community. We partnered with Udacity to create a free online course about web accessibility, released our Accessibility Scanner for Android on the Play Store, and published iOS Accessibility Scanner on GitHub, allowing iOS developers to easily instrument apps to accessibility tests. Together, these efforts support Google's mission to organize the world's information and make it universally accessible and useful.

Learn more about developing with accessibility in mind by visiting the Android Developer Guidelines and the Google Developer Documentation Style Guide.

Create shortcuts in Drive with a new beta

What’s changing 

We’re launching a new beta that allows you to create shortcuts in Drive, making it easy to reference and organize files and folders outside of a given shared drive.

To learn more and express an interest in this beta, see here. We’ll begin accepting domains into this program in the coming weeks.


Who’s impacted 

Admins and end users


Why you’d use it 

Shortcuts are pointers to files that are stored in another folder or in another drive—like a shared drive or another user’s drive—that make it easy to surface content without creating copies of files.

For example, if Paul in marketing shares a document from his team’s shared drive with the entire sales team, Greta in sales can create a shortcut to that document in her own team’s shared drive. Previously, because documents can’t be owned by two shared drives, Greta would need to create a copy of the document for her team’s shared drive, which could then quickly become out of date. 



Additionally, the existing “Add to My Drive” option will be replaced with “Add shortcut to Drive”. Note that files currently living in two locations in My Drive will continue to do so at this time (e.g. those that you’ve added to your My Drive previously).

How to get started 


  • Admins: Admins can express interest in the Google Drive shortcuts beta here. We’ll begin accepting domains into the program in the coming weeks. 
  • End users: Once this feature is enabled for your domain, to create a shortcut: 
    • In Docs, Sheets, and Slides files, you’ll see a new “Add a shortcut to this file in Drive” button next to the “Star” button at the top. 
    • From there, you can select where in your Drive you want the shortcut to appear. From Google Drive, you can right click on a file and select “Add shortcut to Drive” or drag and drop an item into a folder in My Drive. 

Additional details 

You can create a shortcut for the following content types:

  • Google Docs, Google Slides, and Google Sheets files 
  • JPGs, PDFs, and Microsoft Office files 
  • Folders 

Shortcuts are visible to everyone who has access to the folder or drive containing the shortcut. Note that creating a shortcut does not mean sharing access to a file or folder.

Helpful links 



Availability 

G Suite editions 

  • Available to all G Suite editions


Stay up to date with G Suite launches

Updates to our manual Content ID claiming policies

In Susan’s April Creator Letter, she shared that improving creators’ experience with copyright claims is one of our top priorities. One concerning trend we’ve seen is aggressive manual claiming of very short music clips used in monetized videos. These claims can feel particularly unfair, as they transfer all revenue from the creator to the claimant, regardless of the amount of music claimed. A little over a month ago, we took a first step in addressing this by requiring copyright owners to provide timestamps for all manual claims so you know exactly which part of your video is being claimed. We also made updates to our editing tools in Creator Studio that allow you to use those timestamps to remove manually claimed content from your videos, automatically releasing the claim and restoring monetization.



Today, we’re announcing additional changes to our manual claiming policies intended to improve fairness in the creator ecosystem, while still respecting copyright owners’ rights to prevent unlicensed use of their content.



Including someone else’s content without permission — regardless of how short the clip is — means your video can still be claimed and copyright owners will still be able to prevent monetization or block the video from being viewed. However, going forward, our policies will forbid copyright owners from using our Manual Claiming tool to monetize creator videos with very short or unintentional uses of music. This change only impacts claims made with the Manual Claiming tool, where the rightsholder is actively reviewing the video. Claims created by the Content ID match system, which are the vast majority, are not impacted by this policy. Without the option to monetize, some copyright owners may choose to leave very short or unintentional uses unclaimed. Others may choose to prevent monetization of the video by any party. And some may choose to apply a block policy.



As always, the best way to avoid these issues is to not use unlicensed content in your videos, even when it’s unintentional music playing in the background (i.e. vlogging in a store with music playing in the background). Instead, choose content from trusted sources such as the YouTube Audio Library, which has new tracks added every month. If you do find yourself with an unintended claim, you can use our editing tools to remove the claimed content and the restrictions that come with it. And, of course, if you feel that your use qualifies for an exception to copyright, like Fair Use, be sure you understand what that means and how our dispute process works before uploading your video.



Our enforcement of these new policies will apply to all new manual claims beginning in mid-September, providing adequate time for copyright owners to adapt. Once we start enforcement, copyright owners who repeatedly fail to adhere to these policies will have their access to Manual Claiming suspended.



We strive to make YouTube a fair ecosystem for everyone, including songwriters, artists, and YouTube creators. We acknowledge that these changes may result in more blocked content in the near-term, but we feel this is an important step toward striking the right balance over the long-term. Our goal is to unlock new value for everyone by powering creative reuse and content mashups, while fairly compensating all rightsholders.



— The YouTube Team

What’s new in Chrome OS: better audio, camera and notifications

Every Chromebook runs on Chrome OS, which updates every six weeks to keep your device speedy, smart and secure. Each Chrome OS update happens in the background, without interrupting what you’re doing. Here’s some of what’s new on Chromebook this August.

Control your media in one place

New media controls make it easier for you to pause or play sound from a tab or an app. Have you ever had dozens of tabs and apps open and struggled to turn off a specific tab’s audio? If so, we think you’ll find this change helpful—especially for those moments when you start watching a YouTube video and you want to quickly pause your music.

Starting this month, you can open your system menu and see all of the tabs or apps on your Chromebook that are playing audio tracks and control them from one place.

mediacontrol_final

Take great photos on your Chromebook

The Chromebook camera app has been updated to make taking photos and videos easier. Portrait mode is now available on Google Pixel Slate and we are working on bringing it to other Chromebooks. We’ve introduced an updated interface for navigating between new modes, like square mode and portrait mode.

Now, open your camera app, take a selfie with a landscape or square crop, and access it easily in your Downloads folder.

camera_final

Clear your notifications faster

With Chrome OS, you can access all your favorite apps from the Google Play Store. In response to your feedback, it’s now easier for you to check and clear notifications from Play Store apps on your Chromebook. Starting this month, easily dismiss your notifications with the “Clear all” button.

notification_final

We’ll be back in around six weeks to share more of what’s new in Chrome OS. 

Explore college opportunities with new Search features

Summer is winding down, and students across the country are heading back to the classroom. For many students in high school, it’s time to think about their next steps after graduation. While some students may have a certain school or cost considerations in mind, many others may not know where to start or what options are available to them.


The college search feature we launched last year helps students get quick access to information about four-year U.S. universities, including acceptance rates, costs and student outcomes like graduation rates. As this year’s college search season kicks off, we’re expanding our college search features to include two-year colleges and popular certificate and associate programs available at four-year institutions. A new list feature makes it easier to discover a wide range of schools and explore different fields of study.

Considering 2-year colleges

When you use your mobile device to  search for any two-year college in the U.S, you’ll get information about the programs offered, cost of attendance and more. Because many community college students often stay close to home while enrolled in these programs, we show the in-state tuition, as well as total cost with books and housing, to give a better view into what you’ll pay depending on your individual circumstances.


A new take on college lists 

If you’re still narrowing your options, our new exploration tool—available on both mobile and desktop—lets you explore a range of schools based on factors like fields of study or geography. Search for something like “hotel management schools in Georgia” and click “more” to jump into the list.


colleges-list-ui.gif

This feature makes it easy to compare costs, graduation rates, campus life and other characteristics to find the college that best fits your needs. You can also filter by specific location or distance, region, size and acceptance rates.

These features use public information from the U.S. Department of Education’s College Scorecard and Integrated Postsecondary Education Data System (comprehensive datasets available for U.S. colleges and universities). We’ve worked with education researchers, experts, nonprofit organizations, high school counselors, and admissions professionals to build these features to meet the needs of students.

These features will be available today in the U.S., and we’ll continue to find new ways to make information easily available and helpful as you search for future education opportunities.


Admins can now see and edit user recovery information

What’s changing 

G Suite admins can now view and edit their users’ recovery information, such as backup email addresses and linked phone numbers. We also use this information to verify login requests and increase account security. By making sure your users have accurate and up-to-date information you can help make their accounts more secure.

Who’s impacted 

Admins only.

Why you’d use it 

This feature was developed based on customer feedback. Security and recovery information is important for many account verification processes, such as login challenge. To learn more about how adding recovery information can significantly increase the security of your account, see this blog post.

Giving admins the ability to view and edit this information will mean they ensure more accounts have up-to-date recovery information, and increase the accuracy of the recovery information attached to G Suite accounts. This will help:

  • Make it easier for users to access their account if locked out. 
  • Increase challenges and identification of suspicious login attempts to help to keep malicious actors out. 
  • Enable admins to provide direct support to users who are locked out of their account. 


You can still add employee ID as a login challenge for extra security as well.

How to get started 


  • Admins: There are three ways admins can currently manage recovery information: 
    • Individual user accounts: Go to Admin Console > Users > Individual User > Security > Recovery information > Edit. You’ll be able to edit individual user recovery information directly. 
    • Bulk user upload tool (CSV): Use the bulk upload tool at Admin Console > Users to update in bulk. See the edit accounts with a spreadsheet section of this Help Center article for details. 
    • API: Use the Admin SDK Directory API
  • End users: No action needed, but can add recovery information by going to myaccount.google.com


Helpful links 




Availability 

Rollout details 



G Suite editions 
Available to all G Suite editions.

On/off by default? 
This feature will be ON by default.

Stay up to date with G Suite launches