Posted by Maithra Raghu, Google Brain Team and Ari S. Morcos, DeepMind
In order to solve tasks, deep neural networks (DNNs) progressively transform input data into a sequence of complex representations (i.e., patterns of activations across individual neurons). Understanding these representations is critically important, not only for interpretability, but also so that we can more intelligently design machine learning systems. However, understanding these representations has proven quite difficult, especially when comparing representations across networks. In a previous post, we outlined the benefits of Canonical Correlation Analysis (CCA) as a tool for understanding and comparing the representations of convolutional neural networks (CNNs), showing that they converge in a bottom-up pattern, with early layers converging to their final representations before later layers over the course of training.
In “Insights on Representational Similarity in Neural Networks with Canonical Correlation” we develop this work further to provide new insights into the representational similarity of CNNs, including differences between networks which memorize (e.g., networks which can only classify images they have seen before) from those which generalize (e.g., networks which can correctly classify previously unseen images). Importantly, we also extend this method to provide insights into the dynamics of recurrent neural networks (RNNs), a class of models that are particularly useful for sequential data, such as language. Comparing RNNs is difficult in many of the same ways as CNNs, but RNNs present the additional challenge that their representations change over the course of a sequence. This makes CCA, with its helpful invariances, an ideal tool for studying RNNs in addition to CNNs. As such, we have additionally open sourced the code used for applying CCA on neural networks with the hope that will help the research community better understand network dynamics.
Representational Similarity of Memorizing and Generalizing CNNs Ultimately, a machine learning system is only useful if it can generalize to new situations it has never seen before. Understanding the factors which differentiate between networks that generalize and those that don’t is therefore essential, and may lead to new methods to improve generalization performance. To investigate whether representational similarity is predictive of generalization, we studied two types of CNNs:
generalizing networks: CNNs trained on data with unmodified, accurate labels and which learn solutions which generalize to novel data.
memorizing networks: CNNs trained on datasets with randomized labels such that they must memorize the training data and cannot, by definition, generalize (as in Zhang et al., 2017).
We trained multiple instances of each network, differing only in the initial randomized values of the network weights and the order of the training data, and used a new weighted approach to calculate the CCA distance measure (see our paper for details) to compare the representations within each group of networks and between memorizing and generalizing networks.
We found that groups of different generalizing networks consistently converged to more similar representations (especially in later layers) than groups of memorizing networks (see figure below). At the softmax, which denotes the network’s ultimate prediction, the CCA distance for each group of generalizing and memorizing networks decreases substantially, as the networks in each separate group make similar predictions.
Groups of generalizing networks (blue) converge to more similar solutions than groups of memorizing networks (red). CCA distance was calculated between groups of networks trained on real CIFAR-10 labels (“Generalizing”) or randomized CIFAR-10 labels (“Memorizing”) and between pairs of memorizing and generalizing networks (“Inter”).
Perhaps most surprisingly, in later hidden layers, the representational distance between any given pair of memorizing networks was about the same as the representational distance between a memorizing and generalizing network (“Inter” in the plot above), despite the fact that these networks were trained on data with entirely different labels. Intuitively, this result suggests that while there are many different ways to memorize the training data (resulting in greater CCA distances), there are fewer ways to learn generalizable solutions. In future work, we plan to explore whether this insight can be used to regularize networks to learn more generalizable solutions.
Understanding the Training Dynamics of Recurrent Neural Networks So far, we have only applied CCA to CNNs trained on image data. However, CCA can also be applied to calculate representational similarity in RNNs, both over the course of training and over the course of a sequence. Applying CCA to RNNs, we first asked whether the RNNs exhibit the same bottom-up convergence pattern we observed in our previous work for CNNs. To test this, we measured the CCA distance between the representation at each layer of the RNN over the course of training with its final representation at the end of training. We found that the CCA distance for layers closer to the input dropped earlier in training than for deeper layers, demonstrating that, like CNNs, RNNs also converge in a bottom-up pattern (see figure below).
Convergence dynamics for RNNs over the course of training exhibit bottom up convergence, as layers closer to the input converge to their final representations earlier in training than later layers. For example, layer 1 converges to its final representation earlier in training than layer 2 than layer 3 and so on. Epoch designates the number of times the model has seen the entire training set while different colors represent the convergence dynamics of different layers.
Additional findings in our paper show that wider networks (e.g., networks with more neurons at each layer) converge to more similar solutions than narrow networks. We also found that trained networks with identical structures but different learning rates converge to distinct clusters with similar performance, but highly dissimilar representations. We also apply CCA to RNN dynamics over the course of a single sequence, rather than simply over the course of training, providing some initial insights into the various factors which influence RNN representations over time.
Conclusions These findings reinforce the utility of analyzing and comparing DNN representations in order to provide insights into network function, generalization, and convergence. However, there are still many open questions: in future work, we hope to uncover which aspects of the representation are conserved across networks, both in CNNs and RNNs, and whether these insights can be used to improve network performance. We encourage others to try out the code used for the paper to investigate what CCA can tell us about other neural networks!
Acknowledgements Special thanks to Samy Bengio, who is a co-author on this work. We also thank Martin Wattenberg, Jascha Sohl-Dickstein and Jon Kleinberg for helpful comments.
Posted by Sai Deep Tetali, Software Engineer, Google Play Protect
At Google I/O 2017, we introduced Google Play Protect, our comprehensive set of security services for Android. While the name is new, the smarts powering Play Protect have protected Android users for years.
Google Play Protect's suite of mobile threat protections are built into more than 2 billion Android devices, automatically taking action in the background. We're constantly updating these protections so you don't have to think about security: it just happens. Our protections have been made even smarter by adding machine learning elements to Google Play Protect.
Security at scale
Google Play Protect provides in-the-moment protection from potentially harmful apps (PHAs), but Google's protections start earlier.
Before they're published in Google Play, all apps are rigorously analyzed by our security systems and Android security experts. Thanks to this process, Android devices that only download apps from Google Play are 9 times less likely to get a PHA than devices that download apps from other sources.
After you install an app, Google Play Protect continues its quest to keep your device safe by regularly scanning your device to make sure all apps are behaving properly. If it finds an app that is misbehaving, Google Play Protect either notifies you, or simply removes the harmful app to keep your device safe.
Our systems scan over 50 billion apps every day. To keep on the cutting edge of security, we look for new risks in a variety of ways, such as identifying specific code paths that signify bad behavior, investigating behavior patterns to correlate bad apps, and reviewing possible PHAs with our security experts.
In 2016, we added machine learning as a new detection mechanism and it soon became a critical part of our systems and tools.
Training our machines
In the most basic terms, machine learning means training a computer algorithm to recognize a behavior. To train the algorithm, we give it hundreds of thousands of examples of that behavior.
In the case of Google Play Protect, we are developing algorithms that learn which apps are "potentially harmful" and which are "safe." To learn about PHAs, the machine learning algorithms analyze our entire catalog of applications. Then our algorithms look at hundreds of signals combined with anonymized data to compare app behavior across the Android ecosystem to find PHAs. They look for behavior common to PHAs, such as apps that attempt to interact with other apps on the device, access or share your personal data, download something without your knowledge, connect to phishing websites, or bypass built-in security features.
When we find apps exhibit similar malicious behavior, we group them into families. Visualizing these PHA families helps us uncover apps that share similarities to known bad apps, but have yet remained under our radar.
After we identify a new PHA, we confirm our findings with expert security reviews. If the app in question is a PHA, Google Play Protect takes action on the app and then we feed information about that PHA back into our algorithms to help find more PHAs.
Doubling down on security
So far, our machine learning systems have successfully detected 60.3% of the malware identified by Google Play Protect in 2017.
In 2018, we're devoting a massive amount of computing power and talent to create, maintain and improve these machine learning algorithms. We're constantly leveraging artificial intelligence and our highly skilled researchers and engineers from all across Google to find new ways to keep Android devices safe and secure. In addition to our talented team, we work with the foremost security experts and researchers from around the world. These researchers contribute even more data and insights to keep Google Play Protect on the cutting edge of mobile security.
To check out Google Play Protect, open the Google Play app and tap Play Protect in the left panel.
Acknowledgements: This work was developed in joint collaboration with Google Play Protect, Safe Browsing and Play Abuse teams with contributions from Andrew Ahn, Hrishikesh Aradhye, Daniel Bali, Hongji Bao, Yajie Hu, Arthur Kaiser, Elena Kovakina, Salvador Mandujano, Melinda Miller, Rahul Mishra, Damien Octeau, Sebastian Porst, Chuangang Ren, Monirul Sharif, Sri Somanchi, Sai Deep Tetali, Zhikun Wang, and Mo Yu.
Posted by Przemek Pardel, Developer Relations Program Manager, Regional Lead
This summer, Google Developers team is touring 10 countries and 14 cities in Europe in a colorful community bus. We'll be visiting university campuses and technology parks to meet you locally and talk about our programs for developers and start-ups.
Join us to find out how Google supports developer communities. Learn about Google Developer Groups, Women Techmakers program and various ways how we engage with the broader developer community in Europe and around the world.
Our bus will stop in the following locations between 12.00 and 4pm:
4th June, Estonia, Tallinn
6th June, Latvia, Riga
8th June, Lithuania, Vilnius
11th June, Poland, Gdańsk
13th June, Poland, Poznań
15th June, Poland, Kraków
18th June, Slovenia, Ljubljana
19th June, Croatia, Zagreb
21st June, Bulgaria, Sofia
Want to meet us on the way? Sign up for the event in your city here.
What to expect:
Information: learn more about how Google supports developer communities around the world, from content, speakers to a global network
Network: with other community organizers from your city
Workshops: join some of our product workshops on tour (Actions on Google, Google Cloud, Machine Learning), and meet with Google teams
Fun: live music, games and more!
Are you interested in starting a new developer community or are you an organizer who would like to join the global Google Community Program? Let us know and receive an invitation-only pass to our private events.
Posted by Przemek Pardel, Developer Relations Program Manager, Regional Lead
This summer, Google Developers team is touring 10 countries and 14 cities in Europe in a colorful community bus. We'll be visiting university campuses and technology parks to meet you locally and talk about our programs for developers and start-ups.
Join us to find out how Google supports developer communities. Learn about Google Developer Groups, Women Techmakers program and various ways how we engage with the broader developer community in Europe and around the world.
Our bus will stop in the following locations between 12.00 and 4pm:
4th June, Estonia, Tallinn
6th June, Latvia, Riga
8th June, Lithuania, Vilnius
11th June, Poland, Gdańsk
13th June, Poland, Poznań
15th June, Poland, Kraków
18th June, Slovenia, Ljubljana
19th June, Croatia, Zagreb
21st June, Bulgaria, Sofia
Want to meet us on the way? Sign up for the event in your city here.
What to expect:
Information: learn more about how Google supports developer communities around the world, from content, speakers to a global network
Network: with other community organizers from your city
Workshops: join some of our product workshops on tour (Actions on Google, Google Cloud, Machine Learning), and meet with Google teams
Fun: live music, games and more!
Are you interested in starting a new developer community or are you an organizer who would like to join the global Google Community Program? Let us know and receive an invitation-only pass to our private events.
Ever since I can remember, music has been a huge part of who I am. Growing up, my parents formed a traditional Mexican trio band and their music filled the rooms of my childhood home. I’ve always felt deeply moved by music, and I’m fascinated by the emotions music brings out in people.
When I attended community college and took my first physics course, I was introduced to the science of music—how it’s a complex assembly of overlapping sound waves that we sense from the resulting vibrations created in our eardrums. Though my parents had always taken an artistic approach to playing with soundwaves, I took a scientific one. Studying acoustics opened up all kinds of doors for me I never thought were possible, from pursuing a career in electrical engineering—to studying whale calls using machine learning.
Daniel with his family during move in day for his first quarter at Cal Poly.
I applied to the Monterey Bay Aquarium Research Institute (MBARI) summer internship program, where I learned about John Ryan and Danelle Cline’s research using machine learning (ML) to monitor whale sounds. Once again, I found myself fascinated by sound, this time by analyzing the sounds of endangered blue and fin whales to further understand their ecology. By identifying and tracking the whales’ calls and changing migration patterns, scientists hope to gain insight on the broader impacts of climate change on ocean ecology, and how human influence negatively impacts marine life.
MBARI had already collected thousands of hours of audio, but it would have proven too cumbersome of a task to sift through all of that data to find whale calls. That’s what led Danelle to introduce me to machine learning. ML enables us to pick out patterns from very large data sets like MBARI’s audio recordings. By training the model using TensorFlow, we can efficiently sift through the data and track these whales with 98 percent accuracy. This tracking system can tell us how many calls were made in any given amount of time near the Monterey Bay and will enable scientists at MBARI to track their changing migration behavior, and advance their research on whale ecology and how human influence above water negatively impacts marine life below.
What started as a passion for music ended in a love of engineering thanks to the opportunity at MBARI. Before community college I had no idea what an engineer even did, and I certainly never imagined my music background would be relevant in using TensorFlow to identify and classify whale calls within a sea of ocean audio data. But I soon learned there’s more than one way to pursue a passion, and I’m excited for what the future holds—for marine life, for machine learning, and for myself. Following the whales on their journey has led me to begin mine.
Posted by Sujith Ravi, Senior Staff Research Scientist, Google Expander Team
Successful deep learning models often require significant amounts of computational resources, memory and power to train and run, which presents an obstacle if you want them to perform well on mobile and IoT devices. On-device machine learning allows you to run inference directly on the devices, with the benefits of data privacy and access everywhere, regardless of connectivity. On-device ML systems, such as MobileNets and ProjectionNets, address the resource bottlenecks on mobile devices by optimizing for model efficiency. But what if you wanted to train your own customized, on-device models for your personal mobile application?
Yesterday at Google I/O, we announced ML Kit to make machine learning accessible for all mobile developers. One of the core ML Kit capabilities that will be available soon is an automatic model compression service powered by “Learn2Compress” technology developed by our research team. Learn2Compress enables custom on-device deep learning models in TensorFlow Lite that run efficiently on mobile devices, without developers having to worry about optimizing for memory and speed. We are pleased to make Learn2Compress for image classification available soon through ML Kit. Learn2Compress will be initially available to a small number of developers, and will be offered more broadly in the coming months. You can sign up here if you are interested in using this feature for building your own models.
How it Works Learn2Compress generalizes the learning framework introduced in previous works like ProjectionNet and incorporates several state-of-the-art techniques for compressing neural network models. It takes as input a large pre-trained TensorFlow model provided by the user, performs training and optimization and automatically generates ready-to-use on-device models that are smaller in size, more memory-efficient, more power-efficient and faster at inference with minimal loss in accuracy.
Learn2Compress for automatically generating on-device ML models.
To do this, Learn2Compress uses multiple neural network optimization and compression techniques including:
Pruning reduces model size by removing weights or operations that are least useful for predictions (e.g.low-scoring weights). This can be very effective especially for on-device models involving sparse inputs or outputs, which can be reduced up to 2x in size while retaining 97% of the original prediction quality.
Quantization techniques are particularly effective when applied during training and can improve inference speed by reducing the number of bits used for model weights and activations. For example, using 8-bit fixed point representation instead of floats can speed up the model inference, reduce power and further reduce size by 4x.
Joint training and distillation approaches follow a teacher-student learning strategy — we use a larger teacher network (in this case, user-provided TensorFlow model) to train a compact student network (on-device model) with minimal loss in accuracy.
Joint training and distillation approach to learn compact student models.
The teacher network can be fixed (as in distillation) or jointly optimized, and even train multiple student models of different sizes simultaneously. So instead of a single model, Learn2Compress generates multiple on-device models in a single shot, at different sizes and inference speeds, and lets the developer pick one best suited for their application needs.
These and other techniques like transfer learning also make the compression process more efficient and scalable to large-scale datasets.
How well does it work? To demonstrate the effectiveness of Learn2Compress, we used it to build compact on-device models of several state-of-the-art deep networks used in image and natural language tasks such as MobileNets, NASNet, Inception, ProjectionNet, among others. For a given task and dataset, we can generate multiple on-device models at different inference speeds and model sizes.
Accuracy at various sizes for Learn2Compress models and full-sized baseline networks on CIFAR-10 (left) and ImageNet (right) image classification tasks. Student networks used to produce the compressed variants for CIFAR-10 and ImageNet are modeled using NASNet and MobileNet-inspired architectures, respectively.
For image classification, Learn2Compress can generate small and fast models with good prediction accuracy suited for mobile applications. For example, on ImageNet task, Learn2Compress achieves a model 22x smaller than Inception v3 baseline and 4x smaller than MobileNet v1 baseline with just 4.6-7% drop in accuracy. On CIFAR-10, jointly training multiple Learn2Compress models with shared parameters, takes only 10% more time than training a single Learn2Compress large model, but yields 3 compressed models that are upto 94x smaller in size and upto 27x faster with up to 36x lower cost and good prediction quality (90-95% top-1 accuracy).
Computation cost and average prediction latency (on Pixel phone) for baseline and Learn2Compress models on CIFAR-10 image classification task. Learn2Compress-optimized models use NASNet-style network architecture.
We are also excited to see how well this performs on developer use-cases. For example, Fishbrain, a social platform for fishing enthusiasts, used Learn2Compress to compress their existing image classification cloud model (80MB+ in size and 91.8% top-3 accuracy) to a much smaller on-device model, less than 5MB in size, with similar accuracy. In some cases, we observe that it is possible for the compressed models to even slightly outperform the original large model’s accuracy due to better regularization effects.
We will continue to improve Learn2Compress with future advances in ML and deep learning, and extend to more use-cases beyond image classification. We are excited and looking forward to make this available soon through ML Kit’s compression service on the Cloud. We hope this will make it easy for developers to automatically build and optimize their own on-device ML models so that they can focus on building great apps and cool user experiences involving computer vision, natural language and other machine learning applications.
Acknowledgments I would like to acknowledge our core contributors Gaurav Menghani, Prabhu Kaliamoorthi and Yicheng Fan along with Wei Chai, Kang Lee, Sheng Xu and Pannag Sanketi. Special thanks to Dave Burke, Brahim Elbouchikhi, Hrishikesh Aradhye, Hugues Vincent, and Arun Venkatesan from the Android team; Sachin Kotwani, Wesley Tarle, Pavel Jbanov and from the Firebase team; Andrei Broder, Andrew Tomkins, Robin Dua, Patrick McGregor, Gaurav Nemade, the Google Expander team and TensorFlow team.
Editor’s Note: AI is behind many of Google’s products and is a big priority for us as a company (as you may have heard at Google I/O yesterday). So we’re sharing highlights on how AI already affects your life in ways you might not know, and how people from all over the world have used AI to build their own technology.
Machine learning is at the core of many of Google’s own products, but TensorFlow—our open source machine learning framework—has also been an essential component of the work of scientists, researchers and even high school students around the world. At Google I/O, we’re hearing from some of these people, who are solving big (we mean, big) problems—the origin of the universe, that sort of stuff. Here are some of the interesting ways they’re using TensorFlow to aid their work.
Ari Silburt, a Ph.D. student at Penn State University, wants to uncover the origins of our solar system. In order to do this, he has to map craters in the solar system, which helps him figure out where matter has existed in various places (and at various times) in the solar system. You with us? Historically, this process has been done by hand and is both time consuming and subjective, but Ari and his team turned to TensorFlow to automate it. They’ve trained the machine learning model using existing photos of the moon, and have identified more than 6,000 new craters.
On the left is a picture of the moon, hard to tell where the heck those craters are. On the right we have an accurate depiction of crater distribution thanks to TensorFlow.
Switching from outer space to the rainforests of Brazil: Topher White (founder of Rainforest Connection) invented “The Guardian” device to prevent illegal deforestation in the Amazon. The devices—which are upcycled cell phones running on Tensorflow—are installed in trees throughout the forest, recognize the sound of chainsaws and logging trucks, and alert the rangers who police the area. Without these devices, the land must be policed by people, which is nearly impossible given the massive area it covers.
Topher installs guardian devices in the tall trees of the Amazon
Diabetic retinopathy (DR) is the fastest growing cause of blindness, with nearly 415 million diabetic patients at risk worldwide. If caught early, the disease can be treated; if not, it can lead to irreversible blindness. In 2016, we announced that machine learning was being used to aid diagnostic efforts in the area of DR, by analyzing a patient’s fundus image (photo of the back of the eye) with higher accuracy. Now we’re taking those fundus images to the next level with TensorFlow. Dr. Jorge Cuadros, an optometrist in Oakland, CA, is able to determine a patient’s risk of cardiovascular disease by analyzing their fundus image with a deep learning model.
Fundus image of an eye with sight-threatening retinal disease. With machine learning this image will tell doctors much more than eye health.
Good news for green thumbs of the world, Shaza Mehdi and Nile Ravenell are high school students who developed PlantMD, an app that lets you figure out if your plant is diseased. The machine learning model runs on TensorFlow, and Shaza and Nile used data from plantvillage.com and a few university databases to train the model to recognize diseased plants. Shaza also built another app that uses a similar approach to diagnose skin disease.
PlantMD in action
To learn more about how AI can bring benefits to everyone, check out ai.google.
In today's fast-moving world, people have come to expect mobile apps to be intelligent - adapting to users' activity or delighting them with surprising smarts. As a result, we think machine learning will become an essential tool in mobile development. That's why on Tuesday at Google I/O, we introduced ML Kit in beta: a new SDK that brings Google's machine learning expertise to mobile developers in a powerful, yet easy-to-use package on Firebase. We couldn't be more excited!
Machine learning for all skill levels
Getting started with machine learning can be difficult for many developers. Typically, new ML developers spend countless hours learning the intricacies of implementing low-level models, using frameworks, and more. Even for the seasoned expert, adapting and optimizing models to run on mobile devices can be a huge undertaking. Beyond the machine learning complexities, sourcing training data can be an expensive and time consuming process, especially when considering a global audience.
With ML Kit, you can use machine learning to build compelling features, on Android and iOS, regardless of your machine learning expertise. More details below!
Production-ready for common use cases
If you're a beginner who just wants to get the ball rolling, ML Kit gives you five ready-to-use ("base") APIs that address common mobile use cases:
Text recognition
Face detection
Barcode scanning
Image labeling
Landmark recognition
With these base APIs, you simply pass in data to ML Kit and get back an intuitive response. For example: Lose It!, one of our early users, used ML Kit to build several features in the latest version of their calorie tracker app. Using our text recognition based API and a custom built model, their app can quickly capture nutrition information from product labels to input a food's content from an image.
ML Kit gives you both on-device and Cloud APIs, all in a common and simple interface, allowing you to choose the ones that fit your requirements best. The on-device APIs process data quickly and will work even when there's no network connection, while the cloud-based APIs leverage the power of Google Cloud Platform's machine learning technology to give a higher level of accuracy.
Heads up: We're planning to release two more APIs in the coming months. First is a smart reply API allowing you to support contextual messaging replies in your app, and the second is a high density face contour addition to the face detection API. Sign up here to give them a try!
Deploy custom models
If you're seasoned in machine learning and you don't find a base API that covers your use case, ML Kit lets you deploy your own TensorFlow Lite models. You simply upload them via the Firebase console, and we'll take care of hosting and serving them to your app's users. This way you can keep your models out of your APK/bundles which reduces your app install size. Also, because ML Kit serves your model dynamically, you can always update your model without having to re-publish your apps.
But there is more. As apps have grown to do more, their size has increased, harming app store install rates, and with the potential to cost users more in data overages. Machine learning can further exacerbate this trend since models can reach 10's of megabytes in size. So we decided to invest in model compression. Specifically, we are experimenting with a feature that allows you to upload a full TensorFlow model, along with training data, and receive in return a compressed TensorFlow Lite model. The technology behind this is evolving rapidly and so we are looking for a few developers to try it and give us feedback. If you are interested, please sign up here.
Better together with other Firebase products
Since ML Kit is available through Firebase, it's easy for you to take advantage of the broader Firebase platform. For example, Remote Config and A/B testing lets you experiment with multiple custom models. You can dynamically switch values in your app, making it a great fit to swap the custom models you want your users to use on the fly. You can even create population segments and experiment with several models in parallel.
We can't wait to see what you'll build with ML Kit. We hope you'll love the product like many of our early customers:
Get started with the ML Kit beta by visiting your Firebase console today. If you have any thoughts or feedback, feel free to let us know - we're always listening!
Today, we’re kicking off our annual I/O developer conference, which brings together more than 7,000 developers for a three-day event. I/O gives us a great chance to share some of Google’s latest innovations and show how they’re helping us solve problems for our users. We’re at an important inflection point in computing, and it’s exciting to be driving technology forward. It’s clear that technology can be a positive force and improve the quality of life for billions of people around the world. But it’s equally clear that we can’t just be wide-eyed about what we create. There are very real and important questions being raised about the impact of technology and the role it will play in our lives. We know the path ahead needs to be navigated carefully and deliberately—and we feel a deep sense of responsibility to get this right. It’s in that spirit that we’re approaching our core mission.
The need for useful and accessible information is as urgent today as it was when Google was founded nearly two decades ago. What’s changed is our ability to organize information and solve complex, real-world problems thanks to advances in AI.
Pushing the boundaries of AI to solve real-world problems
There’s a huge opportunity for AI to transform many fields. Already we’re seeing some encouraging applications in healthcare. Two years ago, Google developed a neural net that could detect signs of diabetic retinopathy using medical images of the eye. This year, the AI team showed our deep learning model could use those same images to predict a patient’s risk of a heart attack or stroke with a surprisingly high degree of accuracy. We published a paper on this research in February and look forward to working closely with the medical community to understand its potential. We’ve also found that our AI models are able to predict medical events, such as hospital readmissions and length of stays, by analyzing the pieces of information embedded in de-identified health records. These are powerful tools in a doctor’s hands and could have a profound impact on health outcomes for patients. We’re going to be publishing a paper on this research today and are working with hospitals and medical institutions to see how to use these insights in practice.
Another area where AI can solve important problems is accessibility. Take the example of captions. When you turn on the TV it's not uncommon to see people talking over one another. This makes a conversation hard to follow, especially if you’re hearing-impaired. But using audio and visual cues together, our researchers were able to isolate voices and caption each speaker separately. We call this technology Looking to Listen and are excited about its potential to improve captions for everyone.
Saving time across Gmail, Photos, and the Google Assistant
AI is working hard across Google products to save you time. One of the best examples of this is the new Smart Compose feature in Gmail. By understanding the context of an email, we can suggest phrases to help you write quickly and efficiently. In Photos, we make it easy to share a photo instantly via smart, inline suggestions. We’re also rolling out new features that let you quickly brighten a photo, give it a color pop, or even colorize old black and white pictures.
One of the biggest time-savers of all is the Google Assistant, which we announced two years ago at I/O. Today we shared our plans to make the Google Assistant more visual, more naturally conversational, and more helpful.
Thanks to our progress in language understanding, you’ll soon be able to have a natural back-and-forth conversation with the Google Assistant without repeating “Hey Google” for each follow-up request. We’re also adding a half a dozen new voices to personalize your Google Assistant, plus one very recognizable one—John Legend (!). So, next time you ask Google to tell you the forecast or play “All of Me,” don’t be surprised if John Legend himself is around to help.
We’re also making the Assistant more visually assistive with new experiences for Smart Displays and phones. On mobile, we’ll give you a quick snapshot of your day with suggestions based on location, time of day, and recent interactions. And we’re bringing the Google Assistant to navigation in Google Maps, so you can get information while keeping your hands on the wheel and your eyes on the road.
Someday soon, your Google Assistant might be able to help with tasks that still require a phone call, like booking a haircut or verifying a store’s holiday hours. We call this new technology Google Duplex. It’s still early, and we need to get the experience right, but done correctly we believe this will save time for people and generate value for small businesses.
Overview - Smart Compose.gif
Smart Compose can understand the context of an email and suggest phrases to help you write quickly and efficiently.
Overview - Photos.gif
With Google Photos, we’re working on the ability for you to change black-and-white shots into color in just a tap.
Overview - Smart Display.jpg
With Smart Displays, the Google Assistant is becoming more visual.
Understanding the world so we can help you navigate yours
AI’s progress in understanding the physical world has dramatically improved Google Maps and created new applications like Google Lens. Maps can now tell you if the business you’re looking for is open, how busy it is, and whether parking is easy to find before you arrive. Lens lets you just point your camera and get answers about everything from that building in front of you ... to the concert poster you passed ... to that lamp you liked in the store window.
Bringing you the top news from top sources
We know people turn to Google to provide dependable, high-quality information, especially in breaking news situations—and this is another area where AI can make a big difference. Using the latest technology, we set out to create a product that surfaces the news you care about from trusted sources while still giving you a full range of perspectives on events. Today, we’re launching the new Google News. It uses artificial intelligence to bring forward the best of human intelligence—great reporting done by journalists around the globe—and will help you stay on top of what’s important to you.
The new Google News uses AI to bring forward great reporting done by journalists around the globe and help you stay on top of what’s important to you.
Helping you focus on what matters
Advances in computing are helping us solve complex problems and deliver valuable time back to our users—which has been a big goal of ours from the beginning. But we also know technology creates its own challenges. For example, many of us feel tethered to our phones and worry about what we’ll miss if we’re not connected. We want to help people find the right balance and gain a sense of digital wellbeing. To that end, we’re going to release a series of features to help people understand their usage habits and use simple cues to disconnect when they want to, such as turning a phone over on a table to put it in “shush” mode, or “taking a break” from watching YouTube when a reminder pops up. We're also kicking off a longer-term effort to support digital wellbeing, including a user education site which is launching today.
Posted by Yaniv Leviathan, Principal Engineer and Yossi Matias, Vice President, Engineering, Google
A long-standing goal of human-computer interaction has been to enable people to have a natural conversation with computers, as they would with each other. In recent years, we have witnessed a revolution in the ability of computers to understand and to generate natural speech, especially with the application of deep neural networks (e.g., Google voice search, WaveNet). Still, even with today’s state of the art systems, it is often frustrating having to talk to stilted computerized voices that don't understand natural language. In particular, automated phone systems are still struggling to recognize simple words and commands. They don’t engage in a conversation flow and force the caller to adjust to the system instead of the system adjusting to the caller.
Today we announce Google Duplex, a new technology for conducting natural conversations to carry out “real world” tasks over the phone. The technology is directed towards completing specific tasks, such as scheduling certain types of appointments. For such tasks, the system makes the conversational experience as natural as possible, allowing people to speak normally, like they would to another person, without having to adapt to a machine.
One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations.
Here are examples of Duplex making phone calls (using different voices):
Duplex scheduling a hair salon appointment:
Duplex calling a restaurant:
While sounding natural, these and other examples are conversations between a fully automatic computer system and real businesses.
The Google Duplex technology is built to sound natural, to make the conversation experience comfortable. It’s important to us that users and businesses have a good experience with this service, and transparency is a key part of that. We want to be clear about the intent of the call so businesses understand the context. We’ll be experimenting with the right approach over the coming months.
Conducting Natural Conversations There are several challenges in conducting natural conversations: natural language is hard to understand, natural behavior is tricky to model, latency expectations require fast processing, and generating natural sounding speech, with the appropriate intonations, is difficult.
When people talk to each other, they use more complex sentences than when talking to computers. They often correct themselves mid-sentence, are more verbose than necessary, or omit words and rely on context instead; they also express a wide range of intents, sometimes in the same sentence, e.g., “So umm Tuesday through Thursday we are open 11 to 2, and then reopen 4 to 9, and then Friday, Saturday, Sunday we... or Friday, Saturday we're open 11 to 9 and then Sunday we're open 1 to 9.”
Example of complex statement:
In natural spontaneous speech people talk faster and less clearly than they do when they speak to a machine, so speech recognition is harder and we see higher word error rates. The problem is aggravated during phone calls, which often have loud background noises and sound quality issues.
In longer conversations, the same sentence can have very different meanings depending on context. For example, when booking reservations “Ok for 4” can mean the time of the reservation or the number of people. Often the relevant context might be several sentences back, a problem that gets compounded by the increased word error rate in phone calls.
Deciding what to say is a function of both the task and the state of the conversation. In addition, there are some common practices in natural conversations — implicit protocols that include elaborations (“for next Friday” “for when?” “for Friday next week, the 18th.”), syncs (“can you hear me?”), interruptions (“the number is 212-” “sorry can you start over?”), and pauses (“can you hold? [pause] thank you!” different meaning for a pause of 1 second vs 2 minutes).
Enter Duplex Google Duplex’s conversations sound natural thanks to advances in understanding, interacting, timing, and speaking.
At the core of Duplex is a recurrent neural network (RNN) designed to cope with these challenges, built using TensorFlow Extended (TFX). To obtain its high precision, we trained Duplex’s RNN on a corpus of anonymized phone conversation data. The network uses the output of Google’s automatic speech recognition (ASR) technology, as well as features from the audio, the history of the conversation, the parameters of the conversation (e.g. the desired service for an appointment, or the current time of day) and more. We trained our understanding model separately for each task, but leveraged the shared corpus across tasks. Finally, we used hyperparameter optimization from TFX to further improve the model.
Incoming sound is processed through an ASR system. This produces text that is analyzed with context data and other inputs to produce a response text that is read aloud through the TTS system.
Duplex handling interruptions:
Duplex elaborating:
Duplex responding to a sync:
Sounding Natural We use a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine (using Tacotron and WaveNet) to control intonation depending on the circumstance.
The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, we found that conversations using these disfluencies sound more familiar and natural.
Also, it’s important for latency to match people’s expectations. For example, after people say something simple, e.g., “hello?”, they expect an instant response, and are more sensitive to latency. When we detect that low latency is required, we use faster, low-confidence models (e.g. speech recognition or endpointing). In extreme cases, we don’t even wait for our RNN, and instead use faster approximations (usually coupled with more hesitant responses, as a person would do if they didn’t fully understand their counterpart). This allows us to have less than 100ms of response latency in these situations. Interestingly, in some situations, we found it was actually helpful to introduce more latency to make the conversation feel more natural — for example, when replying to a really complex sentence.
System Operation The Google Duplex system is capable of carrying out sophisticated conversations and it completes the majority of its tasks fully autonomously, without human involvement. The system has a self-monitoring capability, which allows it to recognize the tasks it cannot complete autonomously (e.g., scheduling an unusually complex appointment). In these cases, it signals to a human operator, who can complete the task.
To train the system in a new domain, we use real-time supervised training. This is comparable to the training practices of many disciplines, where an instructor supervises a student as they are doing their job, providing guidance as needed, and making sure that the task is performed at the instructor’s level of quality. In the Duplex system, experienced operators act as the instructors. By monitoring the system as it makes phone calls in a new domain, they can affect the behavior of the system in real time as needed. This continues until the system performs at the desired quality level, at which point the supervision stops and the system can make calls autonomously.
Benefits for Businesses and Users Businesses that rely on appointment bookings supported by Duplex, and are not yet powered by online systems, can benefit from Duplex by allowing customers to book through the Google Assistant without having to change any day-to-day practices or train employees. Using Duplex could also reduce no-shows to appointments by reminding customers about their upcoming appointments in a way that allows easy cancellation or rescheduling.
Duplex calling a restaurant:
In another example, customers often call businesses to inquire about information that is not available online such as hours of operation during a holiday. Duplex can call the business to inquire about open hours and make the information available online with Google, reducing the number of such calls businesses receive, while at the same time, making the information more accessible to everyone. Businesses can operate as they always have, there’s no learning curve or changes to make to benefit from this technology.
Duplex asking for holiday hours:
For users, Google Duplex is making supported tasks easier. Instead of making a phone call, the user simply interacts with the Google Assistant, and the call happens completely in the background without any user involvement.
A user asks the Google Assistant for an appointment, which the Assistant then schedules by having Duplex call the business.
Another benefit for users is that Duplex enables delegated communication with service providers in an asynchronous way, e.g., requesting reservations during off-hours, or with limited connectivity. It can also help address accessibility and language barriers, e.g., allowing hearing-impaired users, or users who don’t speak the local language, to carry out tasks over the phone.
This summer, we’ll start testing the Duplex technology within the Google Assistant, to help users make restaurant reservations, schedule hair salon appointments, and get holiday hours over the phone.
Yaniv Leviathan, Google Duplex lead, and Matan Kalman, engineering manager on the project, enjoying a meal booked through a call from Duplex.
Duplex calling to book the above meal:
Allowing people to interact with technology as naturally as they interact with each other has been a long standing promise. Google Duplex takes a step in this direction, making interaction with technology via natural conversation a reality in specific scenarios. We hope that these technology advances will ultimately contribute to a meaningful improvement in people’s experience in day-to-day interactions with computers.