Tag Archives: compression

Lyra – enabling voice calls for the next billion users


Lyra Logo

The past year has shown just how vital online communication is to our lives. Never before has it been more important to clearly understand one another online, regardless of where you are and whatever network conditions are available. That’s why in February we introduced Lyra: a revolutionary new audio codec using machine learning to produce high-quality voice calls.

As part of our efforts to make the best codecs universally available, we are open sourcing Lyra, allowing other developers to power their communications apps and take Lyra in powerful new directions. This release provides the tools needed for developers to encode and decode audio with Lyra, optimized for the 64-bit ARM android platform, with development on Linux. We hope to expand this codebase and develop improvements and support for additional platforms in tandem with the community.

The Lyra Architecture

Lyra’s architecture is separated into two pieces, the encoder and decoder. When someone talks into their phone the encoder captures distinctive attributes from their speech. These speech attributes, also called features, are extracted in chunks of 40ms, then compressed and sent over the network. It is the decoder’s job to convert the features back into an audio waveform that can be played out over the listener’s phone speaker. The features are decoded back into a waveform via a generative model. Generative models are a particular type of machine learning model well suited to recreate a full audio waveform from a limited number of features. The Lyra architecture is very similar to traditional audio codecs, which have formed the backbone of internet communication for decades. Whereas these traditional codecs are based on digital signal processing (DSP) techniques, the key advantage for Lyra comes from the ability of the generative model to reconstruct a high-quality voice signal.

Lyra Architecture Chart

The Impact

While mobile connectivity has steadily increased over the past decade, the explosive growth of on-device compute power has outstripped access to reliable high speed wireless infrastructure. For regions where this contrast exists—in particular developing countries where the next billion internet users are coming online—the promise that technology will enable people to be more connected has remained elusive. Even in areas with highly reliable connections, the emergence of work-from-anywhere and telecommuting have further strained mobile data limits. While Lyra compresses raw audio down to 3kbps for quality that compares favourably to other codecs, such as Opus, it is not aiming to be a complete alternative, but can save meaningful bandwidth in these kinds of scenarios.

These trends provided motivation for Lyra and are the reason our open source library focuses on its potential for real time voice communication. There are also other applications we recognize Lyra may be uniquely well suited for, from archiving large amounts of speech, and saving battery by leveraging the computationally cheap Lyra encoder, to alleviating network congestion in emergency situations where many people are trying to make calls at once. We are excited to see the creativity the open source community is known for applied to Lyra in order to come up with even more unique and impactful applications.

The Open Source Release

The Lyra code is written in C++ for speed, efficiency, and interoperability, using the Bazel build framework with Abseil and the GoogleTest framework for thorough unit testing. The core API provides an interface for encoding and decoding at the file and packet levels. The complete signal processing toolchain is also provided, which includes various filters and transforms. Our example app integrates with the Android NDK to show how to integrate the native Lyra code into a Java-based android app. We also provide the weights and vector quantizers that are necessary to run Lyra.

We are releasing Lyra as a beta version today because we wanted to enable developers and get feedback as soon as possible. As a result, we expect the API and bitstream to change as it is developed. All of the code for running Lyra is open sourced under the Apache license, except for a math kernel, for which a shared library is provided until we can implement a fully open solution over more platforms. We look forward to seeing what people do with Lyra now that it is open sourced. Check out the code and demo on GitHub, let us know what you think, and how you plan to use it!

By Andrew Storus and Michael Chinen – Chrome


The following people helped make the open source release possible:
Yero Yeh, Alejandro Luebs, Jamieson Brettle, Tom Denton, Felicia Lim, Bastiaan Kleijn, Jan Skoglund, Yaowu Xu, Jim Bankoski (Chrome), Chenjie Gu, Zach Gleicher, Tom Walters, Norman Casagrande, Luis Cobo, Erich Elsen (DeepMind).

Lyra: A New Very Low-Bitrate Codec for Speech Compression

Connecting to others online via voice and video calls is something that is increasingly a part of everyday life. The real-time communication frameworks, like WebRTC, that make this possible depend on efficient compression techniques, codecs, to encode (or decode) signals for transmission or storage. A vital part of media applications for decades, codecs allow bandwidth-hungry applications to efficiently transmit data, and have led to an expectation of high-quality communication anywhere at any time.

As such, a continuing challenge in developing codecs, both for video and audio, is to provide increasing quality, using less data, and to minimize latency for real-time communication. Even though video might seem much more bandwidth hungry than audio, modern video codecs can reach lower bitrates than some high-quality speech codecs used today. Combining low-bitrate video and speech codecs can deliver a high-quality video call experience even in low-bandwidth networks. Yet historically, the lower the bitrate for an audio codec, the less intelligible and more robotic the voice signal becomes. Furthermore, while some people have access to a consistent high-quality, high-speed network, this level of connectivity isn’t universal, and even those in well connected areas at times experience poor quality, low bandwidth, and congested network connections.

To solve this problem, we have created Lyra, a high-quality, very low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this, we’ve applied traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.

Lyra Overview
The basic architecture of the Lyra codec is quite simple. Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission. The features themselves are log mel spectrograms, a list of numbers representing the speech energy in different frequency bands, which have traditionally been used for their perceptual relevance because they are modeled after human auditory response. On the other end, a generative model uses those features to recreate the speech signal. In this sense, Lyra is very similar to other traditional parametric codecs, such as MELP.

However traditional parametric codecs, which simply extract from speech critical parameters that can then be used to recreate the signal at the receiving end, achieve low bitrates, but often sound robotic and unnatural. These shortcomings have led to the development of a new generation of high-quality audio generative models that have revolutionized the field by being able to not only differentiate between signals, but also generate completely new ones. DeepMind’s WaveNet was the first of these generative models that paved the way for many to come. Additionally, WaveNetEQ, the generative model-based packet-loss-concealment system currently used in Duo, has demonstrated how this technology can be used in real-world scenarios.

A New Approach to Compression with Lyra
Using these models as a baseline, we’ve developed a new model capable of reconstructing speech using minimal amounts of data. Lyra harnesses the power of these new natural-sounding generative models to maintain the low bitrate of parametric codecs while achieving high quality, on par with state-of-the-art waveform codecs used in most streaming and communication platforms today. The drawback of waveform codecs is that they achieve this high quality by compressing and sending over the signal sample-by-sample, which requires a higher bitrate and, in most cases, isn’t necessary to achieve natural sounding speech.

One concern with generative models is their computational complexity. Lyra avoids this issue by using a cheaper recurrent generative model, a WaveRNN variation, that works at a lower rate, but generates in parallel multiple signals in different frequency ranges that it later combines into a single output signal at the desired sample rate. This trick enables Lyra to not only run on cloud servers, but also on-device on mid-range phones in real time (with a processing latency of 90ms, which is in line with other traditional speech codecs). This generative model is then trained on thousands of hours of speech data and optimized, similarly to WaveNet, to accurately recreate the input audio.

Comparison with Existing Codecs
Since the inception of Lyra, our mission has been to provide the best quality audio using a fraction of the bitrate data of alternatives. Currently, the royalty-free open-source codec Opus, is the most widely used codec for WebRTC-based VOIP applications and, with audio at 32kbps, typically obtains transparent speech quality, i.e., indistinguishable from the original. However, while Opus can be used in more bandwidth constrained environments down to 6kbps, it starts to demonstrate degraded audio quality. Other codecs are capable of operating at comparable bitrates to Lyra (Speex, MELP, AMR), but each suffer from increased artifacts and result in a robotic sounding voice.

Lyra is currently designed to operate at 3kbps and listening tests show that Lyra outperforms any other codec at that bitrate and is compared favorably to Opus at 8kbps, thus achieving more than a 60% reduction in bandwidth. Lyra can be used wherever the bandwidth conditions are insufficient for higher-bitrates and existing low-bitrate codecs do not provide adequate quality.

Clean Speech
[email protected]
[email protected]
[email protected]
Noisy Environment
[email protected]
[email protected]
[email protected]

Reference[email protected][email protected]

Ensuring Fairness
As with any ML based system, the model must be trained to make sure that it works for everyone. We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. One of the design goals of Lyra is to ensure universally accessible high-quality audio experiences. Lyra trains on a wide dataset, including speakers in a myriad of languages, to make sure the codec is robust to any situation it might encounter.

Societal Impact and Where We Go From Here
The implications of technologies like Lyra are far reaching, both in the short and long term. With Lyra, billions of users in emerging markets can have access to an efficient low-bitrate codec that allows them to have higher quality audio than ever before. Additionally, Lyra can be used in cloud environments enabling users with various network and device capabilities to chat seamlessly with each other. Pairing Lyra with new video compression technologies, like AV1, will allow video chats to take place, even for users connecting to the internet via a 56kbps dial-in modem.

Duo already uses ML to reduce audio interruptions, and is currently rolling out Lyra to improve audio call quality and reliability on very low bandwidth connections. We will continue to optimize Lyra’s performance and quality to ensure maximum availability of the technology, with investigations into acceleration via GPUs and TPUs. We are also beginning to research how these technologies can lead to a low-bitrate general-purpose audio codec (i.e., music and other non-speech use cases).

Thanks to everyone who made Lyra possible including Jan Skoglund, Felicia Lim, Michael Chinen, Bastiaan Kleijn, Tom Denton, Andrew Storus, Yero Yeh (Chrome Media), Henrik Lundin, Niklas Blum, Karl Wiberg (Google Duo), Chenjie Gu, Zach Gleicher, Norman Casagrande, Erich Elsen (DeepMind).

Source: Google AI Blog

Basis Universal Textures – Khronos Ratification and Support

In 2019, Google partnered with Binomial to open source the Basis Universal texture codec with the goal to make high-quality textures more efficient for network transmission and graphics processing unit (GPU) memory usage. The Basis Universal texture format is 6-8 times smaller than JPEG on the GPU, yet has similar storage size as JPEG—making it a great alternative to current GPU compression methods that are inefficient and don’t operate cross platform. The format is intended for a variety of use cases: games, virtual and augmented reality, maps, photos, small videos, and more.

the Basis Universal texture codec
Over the past year, several exciting developments have been made to make Basis Universal more useful. A new high-quality mode was introduced, allowing the codec to use the highest quality formats modern GPUs support, finally bringing the web up to modern GPU texture standards—with cross platform support. Additionally, the Basis encoder now has an option to build a WebAssembly version, allowing for innovative web applications to take advantage of outputting to the super-compressed format. Lastly, the Khronos Group has announced and ratified the Basis Universal texture extension to glTF format, allowing for compressed assets that can be shipped and displayed everywhere in a KTX 2.0 container. This will have profound impacts on how models are distributed via the web and advance applications like eCommerce, making it easy to take advantage of 3D content on any platform.

In addition to these new features, developers worldwide have been making it easier to take advantage of Basis Universal. <model-viewer> has just added support for glTF files with universal textures, making it as easy as two lines of JavaScript to have beautiful, interactive 3D models on your page and in the coming months, the <model-viewer> editor will add support for encoding to universal textures. Additionally, 3D engines like Three.js, Babylon.js, Godot, Archilogic, and Playcanvas have added support for Basis Universal, with more engine support coming. Basis Universal is already in applications many use every day.

We look forward to seeing Basis Universal adoption soar as it has never been easier to distribute 3D assets. Check out the code and demo on GitHub, let us know what you think, and how you plan to use it!

By Stephanie Hurlburt, Binomial and Jamieson Brettle, Chrome Media

Improving Sparse Training with RigL

Modern deep neural network architectures are often highly redundant [1, 2, 3], making it possible to remove a significant fraction of connections without harming performance. The sparse neural networks that result have been shown to be more parameter and compute efficient compared to dense networks, and, in many cases, can significantly decrease wall clock inference times.

By far the most popular method for training sparse neural networks is pruning, (dense-to-sparse training) which usually requires first training a dense model, and then “sparsifying” it by cutting out the connections with negligible weights. However, this process has two limitations.

  1. The size of the largest trainable sparse model is limited by that of the largest trainable dense model. Even if sparse models are more parameter efficient, one cannot use pruning to train models that are larger and more accurate than the largest possible dense models.
  2. Pruning is inefficient, meaning that large amounts of computation must be performed for parameters that are zero valued or that will be zero during inference. Additionally, it remains unknown if the performance of the current best pruning algorithms are an upper bound on the quality of sparse models.
Training sparse networks from scratch, on the other hand, is efficient, however often achieves inferior performance compared to pruning.

In “Rigging the Lottery: Making All Tickets Winners”, presented at ICML 2020, we introduce RigL, an algorithm for training sparse neural networks that uses a fixed parameter count and computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. The algorithm identifies which neurons should be active during training, which helps the optimization process to utilize the most relevant connections and results in better sparse solutions. An example of this is shown below, where, during the training of a multilayer perceptron (MLP) network on MNIST, our sparse network trained with RigL learns to focus on the center of the images, discarding the uninformative pixels from the edges. A Tensorflow implementation of our method along with three other baselines (SET, SNFS, SNIP) can be found at github.com/google-research/rigl.

Left: Average MNIST image. Right: Evolution of the connectivity of the input throughout the training of a 98% sparse, 2-layer MLP on MNIST. Training starts from a random sparse mask, where each input pixel has roughly six outgoing connections. Connections that originate from the edges do not exhibit meaningful gradients and are therefore replaced by more informative connections that originate from the center pixels.

RigL Overview
The RigL method starts with a network initialized with a random sparse topology. At regularly spaced intervals we remove a fraction of the connections with the smallest weight magnitudes. Such a strategy has been shown to have very little effect on the loss. RigL then activates new connections using instantaneous gradient information, i.e., without using past gradient information. After updating the connectivity, training continues with the updated network until the next scheduled update. Next, the system activates connections with large gradients, since these connections are expected to decrease the loss most quickly.

RigL begins with a random sparse initialization of the network. It then trains the network and trims out those connections with weak activations. Based on the gradients calculated for the new configuration, it grows new connections and trains again, repeating the cycle.

Evaluating Performance
By changing the connectivity of the neurons dynamically during training, RigL helps optimize to find better solutions. To demonstrate this, we restart training from a bad solution that exhibits poor accuracy and show that RigL's mask updates help the optimization achieve better loss compared to static training, in which connectivity of the sparse network remains the same.

Training loss of RigL and Static methods starting from the same static sparse solution, shown together with their final test accuracies.

The figure below summarizes the performance of various methods on training an 80% sparse ResNet-50 architecture. We compare RigL with two recent sparse training methods, SET and SNFS and three baseline training methods: Static, Small-Dense and Pruning. Two of these methods (SNFS and Pruning) require dense resources as they need to either train a large network or store the gradients of it. Overall, we observe that the performance of all methods improves with additional training time; thus, for each method we run extended training with up to 5x the training steps of the original 100 epochs.

As noted in a number of studies [4, 5, 6, 7], training a network with fixed sparsity from scratch (Static) leads to inferior performance compared to solutions found by pruning. Training a small, dense network (Small-Dense) with the same number of parameters gets better results than Static, but fails to match the performance of dynamic sparse models. Similarly, SET improves the performance over Small-Dense, but saturates at around 75% accuracy, revealing the limits of growing new connections randomly. Methods that use gradient information to grow new connections (RigL and SNFS) obtain higher accuracy in general, but RigL achieves the highest accuracy, while also consistently requiring fewer FLOPs (and memory footprint) than the other methods.

Performance of sparse training methods on training an 80% sparse ResNet-50 architecture with uniform sparsity distribution. Points at each curve correspond to the individual training runs with increasing training length. The number of FLOPs required to train a standard dense ResNet-50 along with its performance is indicated with a dashed red line. RigL matches the standard ResNet-50 performance, even though it is 5x smaller in size.

Observing the trend between extended training and performance, we compare the results using longer training runs. Within the interval considered (i.e., 1x-100x) RigL's performance constantly improves with additional training. RigL achieves state of art performance of 68.07% Top-1 accuracy at training with a 99% sparse ResNet-50 architecture. Similarly extended training of a 90% sparse MobileNet-v1 architecture with RigL achieves 70.55% Top-1 accuracy. Obtaining the same results with fewer training iterations is an exciting future research direction.

Effect of training time on RigL accuracy at training 99% sparse ResNet-50 (left) and 90% sparse MobileNets-v1 (right) architectures.

Other experiments include image classification on CIFAR-10 datasets and character-based language modelling using RNNs with the WikiText-103 dataset and can be found in the full paper.

Future Work
RigL is useful in three different scenarios:

  1. Improving the accuracy of sparse models intended for deployment.
  2. Improving the accuracy of large sparse models that can only be trained for a limited number of iterations.
  3. Combining with sparse primitives to enable training of extremely large sparse models which otherwise would not be possible.
The third scenario is unexplored due to the lack of hardware and software support for sparsity. Nonetheless, work continues [8, 9, 10] to improve the performance of sparse networks on current hardware and new types of hardware accelerators are expected to have better support for parameter sparsity [11, 12]. We hope RigL provides the tools to take advantage of, and motivation for, such advances.

AcknowledgementsWe would like to thank Eleni Triantafillou, Hugo Larochelle, Bart van Merrienboer, Fabian Pedregosa, Joan Puigcerver, Danny Tarlow, Nicolas Le Roux, Karen Simonyan for giving feedback on the preprint of the paper; Namhoon Lee for helping us verify and debug our SNIP implementation; Chris Jones for helping us discover and solve the distributed training bug; and Tom Small for creating the visualization of the algorithm.

Source: Google AI Blog

Celebrating 10 years of WebM and WebRTC

Originally posted on the Chromium Blog

Ten years ago, Google planted the seeds for two foundational web media technologies, hoping they would provide the roots for a more vibrant internet. Two acquisitions, On2 Technologies and Global IP Solutions, led to a pair of open source projects: the WebM Project, a family of cutting edge video compression technologies (codecs) offered by Google royalty-free, and the WebRTC Project building APIs for real-time voice and video communication on the web.

These initiatives were major technical endeavors, essential infrastructure for enabling the promise of HTML5 with support for video conferencing and streaming. But this was also a philosophical evolution for media as Product Manager Mike Jazayeri noted in his blog post hailing the launch of the WebM Project:
“A key factor in the web’s success is that its core technologies such as HTML, HTTP, TCP/IP, etc. are open and freely implementable.”
As emerging first-class participants in the web experience, media and communication components also had to be free and open.

A decade later, these principles have ensured compression and communication technologies capable of keeping pace with a web ecosystem characterized by exponential growth of media consumption, devices, and demand. Starting from VP8 in 2010, the WebM Project has delivered up to 50% video bitrate savings with VP9 in 2013 and an additional 30% with AV1 in 2018—with adoption by YouTube, Facebook, Netflix, Twitch, and more. Equally importantly, the WebM team co-founded the Alliance for Open Media which has brought the IP of over 40 major tech companies in support of open and free codecs. With Chrome, Edge, Firefox and Safari supporting WebRTC, more than 85% of all installed browsers globally have become a client for real-time communications on the Internet. WebRTC has become a stable standard and it is now the default solution for video calling on the Web. These technologies have succeeded together, as today over 90% of encoded WebRTC video in Chrome uses VP8 or VP9.

The need for these technologies has been highlighted by COVID-19, as people across the globe have found new ways to work, educate, and connect with loved ones via video chat. The compression of open codecs has been essential to keeping services running on limited bandwidth, with over a billion hours of VP9 and AV1 content viewed every day. WebRTC has allowed for an ecosystem of interoperable communications apps to flourish: since the beginning of March 2020, we have seen in Chrome a 13X increase in received video streams via WebRTC.

These successes would not have been possible without all the supporters that make an open source community. Thank you to all the code contributors, testers, bug filers, and corporate partners who helped make this ecosystem a reality. A decade in, Google remains as committed as ever to open media on the web. We look forward to continuing that work with all of you in the next decade and beyond.

By Matt Frost, Product Director Chrome Media and Niklas Blum, Senior Product Manager WebRTC

Optimizing Multiple Loss Functions with Loss-Conditional Training

In many machine learning applications the performance of a model cannot be summarized by a single number, but instead relies on several qualities, some of which may even be mutually exclusive. For example, a learned image compression model should minimize the compressed image size while maximizing its quality. It is often not possible to simultaneously optimize all the values of interest, either because they are fundamentally in conflict, like the image quality and the compression ratio in the example above, or simply due to the limited model capacity. Hence, in practice one has to decide how to balance the values of interest.
The trade-off between the image quality and the file size in image compression. Ideally both the image distortion and the file size would be minimized, but these two objectives are fundamentally in conflict.
The standard approach to training a model that must balance different properties is to minimize a loss function that is the weighted sum of the terms measuring those properties. For instance, in the case of image compression, the loss function would include two terms, corresponding to the image reconstruction quality and the compression rate. Depending on the coefficients on these terms, training with this loss function results in a model producing image reconstructions that are either more compact or of higher quality.

If one needs to cover different trade-offs between model qualities (e.g. image quality vs compression rate), the standard practice is to train several separate models with different coefficients in the loss function of each. This requires keeping around multiple models both during training and inference, which is very inefficient. However, all of these separate models solve very related problems, suggesting that some information could be shared between them.

In two concurrent papers accepted at ICLR 2020, we propose a simple and broadly applicable approach that avoids the inefficiency of training multiple models for different loss trade-offs and instead uses a single model that covers all of them. In “You Only Train Once: Loss-Conditional Training of Deep Networks”, we give a general formulation of the method and apply it to several tasks, including variational autoencoders and image compression, while in “Adjustable Real-time Style Transfer”, we dive deeper into the application of the method to style transfer.

Loss-Conditional Training
The idea behind our approach is to train a single model that covers all choices of coefficients of the loss terms, instead of training a model for each set of coefficients. We achieve this by (i) training the model on a distribution of losses instead of a single loss function, and (ii) conditioning the model outputs on the vector of coefficients of the loss terms. This way, at inference time the conditioning vector can be varied, allowing us to traverse the space of models corresponding to loss functions with different coefficients.

This training procedure is illustrated in the diagram below for the style transfer task. For each training example, first the loss coefficients are randomly sampled. Then they are used both to condition the main network via the conditioning network and to compute the loss. The whole system is trained jointly end-to-end, i.e., the model parameters are trained concurrently with random sampling of loss functions.
Overview of the method, using stylization as an example. The main stylization network is conditioned on randomly sampled coefficients of the loss function and is trained on a distribution of loss functions, thus learning to model the entire family of loss functions.
The conceptual simplicity of this approach makes it applicable to many problem domains, with only minimal changes to existing code bases. Here we focus on two such applications, image compression and style transfer.

Application: Variable-Rate Image Compression
As a first example application of our approach, we show the results for learned image compression. When compressing an image, a user should be able to choose the desired trade-off between the image quality and the compression rate. Classic image compression algorithms are designed to allow for this choice. Yet, many leading learned compression methods require training a separate model for each such trade-off, which is computationally expensive both at training and at inference time. For problems such as this, where one needs a set of models optimized for different losses, our method offers a simple way to avoid inefficiency and cover all trade-offs with a single model.

We apply the loss-conditional training technique to the learned image compression model of Balle et al. The loss function here consists of two terms, a reconstruction term responsible for the image quality and a compactness term responsible for the compression rate. As illustrated below, our technique allows training a single model covering a wide range of quality-compression tradeoffs.
Compression at different quality levels with a single model. All animations are generated with a single model by varying the conditioning value.
Application: Adjustable Style Transfer
The second application we demonstrate is artistic style transfer, in which one synthesizes an image by merging the content from one image and the style from another. Recent methods allow training deep networks that stylize images in real time and in multiple styles. However, for each given style these methods do not allow the user to have control over the details of the synthesized output, for instance, how much to stylize the image and on which style features to place greater emphasis. If the stylized output is not appealing to the user, they have to train multiple models with different hyper-parameters until they get a favorite stylization.

Our proposed method instead allows training a single model covering a wide range of stylization variants. In this task, we condition the model on a loss function, which has coefficients corresponding to five loss terms, including the content loss and four terms for the stylization loss. Intuitively, the content loss regulates how much the stylized image should be similar to the original content, while the four stylization losses define which style features get carried over to the final stylized image. Below we show the outputs of our single model when varying all these coefficients:
Adjustable style transfer. All stylizations are generated with a single network by varying the conditioning values.
Clearly, the model captures a lot of variation within each style, such as the degree of stylization, the type of elements being added to the image, their exact configuration and locations, and more. More examples can be found on our webpage along with an interactive demo.

We have proposed loss-conditional training, a simple and general method that allows training a single deep network for tasks that would formerly require a large set of separately trained networks. While we have shown its application to image compression and style transfer, many more applications are possible — whenever the loss function has coefficients to be tuned, our method allows training a single model covering a wide range of these coefficients.

This blog post covers the work by multiple researchers on the Google Brain team: Mohammad Babaeizadeh, Johannes Balle, Josip Djolonga, Alexey Dosovitskiy, and Golnaz Ghiasi. This blog post would not be possible without crucial contributions from all of them. Images from the MS-COCO dataset and from unsplash.com are used for illustrations.

Source: Google AI Blog

Google and Binomial partner to open source high quality basis universal

Today, Google and Binomial are excited to announce the high quality update to the original Basis Universal release.

Basis Universal allows you to have state of the art web performance with your images, keeping images compressed even on the GPU. Older systems like JPEG and PNG may look small in storage size, but once they hit the GPU they are processed as uncompressed data! The original Basis Universal codec created images that were 6-8 times smaller than JPEG on the GPU while maintaining a similar storage size.

Today we release a high quality Basis Universal codec that utilizes the highest quality formats modern GPUs support, finally bringing the web up to modern GPU texture standards—with cross platform support. The textures are larger in storage size and GPU compressed size, but are still 3-4 times smaller than sending a JPEG or PNG file to be processed on the GPU, and can transcode to a lower quality format for older GPUs.
Original Image by Erol Ahmed from Unsplash.com
Visual comparison of Basis Universal High Quality

Best of all, we are actively working on standardizing Basis Universal with the Khronos Group.

Since our original release in Summer 2019 we’ve seen widespread adoption of Basis Universal in engines like three.js, Babylon.js, Godot, and more, changing what is possible for people to create on the web. Now that a high quality option is available, we expect to see even more adoption and groundbreaking applications created with it.

Please feel free to join our community on Github and check out the full demo there as well. You can also follow standardization efforts via Khronos Group events and forums.

By Stephanie Hurlburt, Binomial and Jamieson Brettle, Chrome Media

Announcing the Third Workshop and Challenge on Learned Image Compression

With the large amount of media content being downloaded and streamed across the internet, minimizing bandwidth while maintaining quality remains a constant challenge. In 2015, researchers demonstrated that neural network-based image compression could yield significant improvements to image resolution while retaining good quality and high compression speed. Continued advances in compression and bandwidth optimization techniques were stimulated in part by two successful workshops that we hosted at CVPR in 2018 and 2019.

Today, we are excited to announce the Third Workshop and Challenge On Learned Image Compression (CLIC) at CVPR 2020. This workshop challenges researchers to use machine learning, neural networks and other computer vision approaches to increase the quality and lower the bandwidth needed for multimedia transmission. This year’s workshop will also include two challenges: a low-rate image compression challenge and a P-Frame video compression challenge.

Similar to previous years, the goal of the low-rate image compression challenge is to compress an image dataset to 0.15 bits per pixel while maintaining the highest possible quality. Finalists will be selected by measuring their performance against the PSNR and MS-SSIM evaluation metrics. The final ranking will then be determined by a human evaluated rating task.

This year we are also introducing a P-Frame compression track, the first video compression task in this series. In this challenge, participants must first generate a transformation between two adjacent video frames. In the decompression part of the task, participants then use the first frame and their compressed representation to reconstruct the second frame. This challenge will be ranked based solely on the MS-SSIM performance score.

If you are doing research in the field of learned image compression or video compression, we encourage you to participate in CLIC, whether in the two competitions or the paper-only track for publications to be presented at the workshop at CVPR 2020. The validation server is currently available for submissions. The deadline for the final submission of the test set is March 23rd, 2020. For more details on the competition and an up-to-date schedule, please refer to compression.cc. Additional announcements and answers to questions can be found on our Google Groups page.

This workshop is being jointly hosted by researchers at Google, Twitter and ETH Zurich. We’d like to thank: George Toderici (Google), Nick Johnston (Google), Johannes Ballé (Google), Eirikur Agustsson (Google), Lucas Theis (Google), Wenzhe Shi (Twitter), Radu Timofte (ETH Zurich) and Fabian Mentzer (ETH Zurich) for their contributions.

Source: Google AI Blog

Google and Binomial Partner to Open-Source Basis Universal Texture Format

Today, Google and Binomial are excited to announce that we have partnered to open source the Basis Universal texture codec to improve the performance of transmitting images on the web and within desktop and mobile applications, while maintaining GPU efficiency. This release fills an important gap in the graphics compression ecosystem and complements earlier work in Draco geometry compression.

The Basis Universal texture format is 6-8 times smaller than JPEG on the GPU, yet is a similar storage size as JPEG – making it a great alternative to current GPU compression methods that are inefficient and don’t operate cross platform – and provides a more performant alternative to JPEG/PNG. It creates compressed textures that work well in a variety of use cases - games, virtual & augmented reality, maps, photos, small-videos, and more!

Without a universal texture format, developers are left with 2 options:

  • Use GPU formats and take the storage size hit.
  • Use other formats that have reduced storage size but couldn't compete with the GPU performance.

Maintaining so many different GPU formats is a burden on the whole ecosystem, from GPU manufacturers to software developers to the end user who can’t get a great cross platform experience. We’re streamlining this with one solution that has built-in flexibility (like optional higher quality modes) but is much easier on everyone to improve and maintain.

How does it all work? Compress your image using the encoder, choosing the quality settings that make sense for your project (you can also submit multiple images for small videos or optimization purposes, just know they’ll share the same color palette). Insert the transcoder code before rendering, which will turn the intermediary format into the GPU format your computer can read. The image stays compressed throughout this process, even on your GPU!  Instead of needing to decode and read the whole image, the GPU will read only the parts it needs. Enjoy the performance benefits!
Basis Universal can efficiently target the most common GPU formats
Google and Binomial will be working together to continue to support, maintain and add features, so check back frequently for the latest. This initial release of Basis Universal transcodes into the following GPU formats: PVRTC1 opaque, ETC1, ETC2 basic alpha, BC1-5, and BC7 opaque. Over the coming months more functionality will be added including BC7 transparent, ASTC opaque and alpha, PVRTC1 transparent, and higher quality BC7/ASTC.
Basis Universal reduces transmission size for texture while maintaining similar image quality.
See full benchmarking results
Basis Universal improves GPU memory usage over .jpeg and .png
With this partnership, we hope to see adoption of the transcoder in all major browsers to make performant cross-platform compressed textures accessible to everyone via the WebGL API, and the forthcoming WebGPU API. In addition to opening up the possibility of seamless integration into pipelines, everyone now has access to the state of the art compressor, which will also be open sourced.

We look forward to seeing what people do with Basis Universal now that it's open sourced. Check out the code and demo on GitHub, let us know what you think, and how you plan to use it! Currently, Basis Universal transcoders are available in C++ and WebAssembly.

By Stephanie Hurlburt, Binomial and Jamieson Brettle, Chrome Media

Announcing the Second Workshop and Challenge on Learned Image Compression

Last year, we announced the Workshop and Challenge on Learned Image Compression (CLIC), an event that aimed to advance the field of image compression with and without neural networks. Held during the 2018 Computer Vision and Pattern Recognition conference (CVPR 2018), CLIC was quite a success, with 23 accepted workshop papers, 95 authors and 41 entries into the competition. This spawned many new algorithms for image compression, domain specific applications to medical image compression and augmentations to existing methods based, with the winner Tucodec (abbreviated TUCod4c in the image below) achieving 13% better mean opinion score (MOS) than Better Portable Graphics (BPG) compression.
An example image from the 2018 test set, comparing the original image to BPG, JPEG and the results from nine competing teams. All the methods are better than JPEG in color reproduction and many of them are comparable to BPG in their ability to create legible text on the sign.
This year, we are again happy co-sponsor the second Workshop and Challenge on Learned Image Compression at CVPR 2019 in Long Beach, California.The half day workshop will feature talks from invited guests Anne Aaron (Netflix), Aaron Van Den Oord (DeepMind) and Jyrki Alakuijala (Google), along with presentations from five top performing teams in the 2019 competition, which is currently open for submissions.

This year's competition features two tracks for participants to compete in. The first track remains the same as last year, in what we're calling the "low-rate compression" track. The goal for low-rate compression is to compress an image dataset to 0.15 bits per pixel and maintaining the highest quality metrics as measured by PSNR, MS-SSIM and a human evaluated rating task.

The second track incorporates feedback from last year's workshop, in which participants expressed interest in the inverse challenge of determining the amount an image could be compressed and still look good. In this "transparent compression" challenge, we set a relatively high quality threshold for the test dataset (in both PSNR and MS-SSIM) with the goal of compressing the dataset to the smallest file sizes.

If you're doing research in the field of learned image compression, we encourage you to participate in CLIC during CVPR 2019. For more details on the competition and dates, please refer to compression.cc.

This workshop is being jointly hosted by researchers at Google, Twitter and ETH Zürich. We'd like to thank: George Toderici (Google), Michele Covell (Google), Johannes Ballé (Google), Nick Johnston (Google), Eirikur Agustsson (Google), Wenzhe Shi (Twitter), Lucas Theis (Twitter), Radu Timofte (ETH Zürich), Fabian Mentzer (ETH Zürich) for their contributions.

Source: Google AI Blog