Posted by Badih Ghazi and Joshua R. Wang, Research Scientists, Google Research
Much of classical machine learning (ML) focuses on utilizing available data to make more accurate predictions. More recently, researchers have considered other important objectives, such as how to design algorithms to be small, efficient, and robust. With these goals in mind, a natural research objective is the design of a system on top of neural networks that efficiently stores information encoded within—in other words, a mechanism to compute a succinct summary (a “sketch”) of how a complex deep network processes its inputs. Sketching is a rich field of study that dates back to the foundational work of Alon, Matias, and Szegedy, which can enable neural networks to efficiently summarize information about their inputs.
For example: Imagine stepping into a room and briefly viewing the objects within. Modern machine learning is excellent at answering immediate questions, known at training time, about this scene: “Is there a cat? How big is said cat?” Now, suppose we view this room every day over the course of a year. People can reminisce about the times they saw the room: “How often did the room contain a cat? Was it usually morning or night when we saw the room?”. However, can one design systems that are also capable of efficiently answering such memory-based questions even if they are unknown at training time?
In “Recursive Sketches for Modular Deep Learning”, recently presented at ICML 2019, we explore how to succinctly summarize how a machine learning model understands its input. We do this by augmenting an existing (already trained) machine learning model with “sketches” of its computation, using them to efficiently answer memory-based questions—for example, image-to-image-similarity and summary statistics—despite the fact that they take up much less memory than storing the entire original computation.
Basic Sketching Algorithms In general, sketching algorithms take a vector x and produce an output sketch vector that behaves like x but whose storage cost is much smaller. The fact that the storage cost is much smaller allows one to succinctly store information about the network, which is critical for efficiently answering memory-based questions. In the simplest case, a linear sketch x is given by the matrix-vector product Ax where A is a wide matrix, i.e., the number of columns is equal to the original dimension of x and the number of rows is equal to the new reduced dimension. Such methods have led to a variety of efficient algorithms for basic tasks on massive datasets, such as estimating fundamental statistics (e.g., histogram, quantiles and interquartile range), finding popular items (known as frequent elements), as well as estimating the number of distinct elements (known as support size) and the related tasks of norms and entropy estimation.
A simple method to sketch the vector x is to multiply it by a wide matrix A to produce a lower-dimensional vector y.
This basic approach works well in the relatively simple case of linear regression, where it is possible to identify important data dimensions simply by the magnitude of weights (under the common assumption that they have uniform variance). However, many modern machine learning models are actually deep neural networks and are based on high-dimensional embeddings (such as Word2Vec, Image Embeddings, Glove, DeepWalk and BERT), which makes the task of summarizing the operation of the model on the input much more difficult. However, a large subset of these more complex networks are modular, allowing us to generate accurate sketches of their behavior, in spite of their complexity.
Neural Network Modularity A modular deep network consists of several independent neural networks (modules) that only communicate via one’s output serving as another’s input. This concept has inspired several practical architectures, including Neural Modular Networks, Capsule Neural Networks and PathNet. It is also possible to split other canonical architectures to view them as modular networks and apply our approach. For example, convolutional neural networks (CNNs) are traditionally understood to behave in a modular fashion; they detect basic concepts and attributes in their lower layers and build up to detecting more complex objects in their higher layers. In this view, the convolution kernels correspond to modules. A cartoon depiction of a modular network is given below.
This is a cartoon depiction of a modular network for image processing. Data flows from the bottom of the figure to the top through the modules represented with blue boxes. Note that modules in the lower layers correspond to basic objects, such as edges in an image, while modules in upper layers correspond to more complex objects, like humans or cats. Also notice that in this imaginary modular network, the output of the face module is generic enough to be used by both the human and cat modules.
Sketch Requirements To optimize our approach for these modular networks, we identified several desired properties that a network sketch should satisfy:
Sketch-to-Sketch Similarity: The sketches of two unrelated network operations (either in terms of the present modules or in terms of the attribute vectors) should be very different; on the other hand, the sketches of two similar network operations should be very close.
Attribute Recovery: The attribute vector, e.g., the activations of any node of the graph can be approximately recovered from the top-level sketch.
Summary Statistics: If there are multiple similar objects, we can recover summary statistics about them. For example, if an image has multiple cats, we can count how many there are. Note that we want to do this without knowing the questions ahead of time.
Graceful Erasure: Erasing a suffix of the top-level sketch maintains the above properties (but would smoothly increase the error).
Network Recovery: Given sufficiently many (input, sketch) pairs, the wiring of the edges of the network as well as the sketch function can be approximately recovered.
This is a 2D cartoon depiction of the sketch-to-sketch similarity property. Each vector represents a sketch and related sketches are more likely to cluster together.
The Sketching Mechanism The sketching mechanism we propose can be applied to a pre-trained modular network. It produces a single top-level sketch summarizing the operation of this network, simultaneously satisfying all of the desired properties above. To understand how it does this, it helps to first consider a one-layer network. In this case, we ensure that all the information pertaining to a specific node is “packed” into two separate subspaces, one corresponding to the node itself and one corresponding to its associated module. Using suitable projections, the first subspace lets us recover the attributes of the node whereas the second subspace facilitates quick estimates of summary statistics. Both subspaces help enforce the aforementioned sketch-to-sketch similarity property. We demonstrate that these properties hold if all the involved subspaces are chosen independently at random.
Of course, extra care has to be taken when extending this idea to networks with more than one layer—which leads to our recursive sketching mechanism. Due to their recursive nature, these sketches can be “unrolled” to identify sub-components, capturing even complicated network structures. Finally, we utilize a dictionary learning algorithm tailored to our setup to prove that the random subspaces making up the sketching mechanism together with the network architecture can be recovered from a sufficiently large number of (input, sketch) pairs.
Future Directions The question of succinctly summarizing the operation of a network seems to be closely related to that of model interpretability. It would be interesting to investigate whether ideas from the sketching literature can be applied to this domain. Our sketches could also be organized in a repository to implicitly form a “knowledge graph”, allowing patterns to be identified and quickly retrieved. Moreover, our sketching mechanism allows for seamlessly adding new modules to the sketch repository—it would be interesting to explore whether this feature can have applications to architecture search and evolving network topologies. Finally, our sketches can be viewed as a way of organizing previously encountered information in memory, e.g., images that share the same modules or attributes would share subcomponents of their sketches. This, on a very high level, is similar to the way humans use prior knowledge to recognize objects and generalize to unencountered situations.
Acknowledgements This work was the joint effort of Badih Ghazi, Rina Panigrahy and Joshua R. Wang.
Posted by Paul Dille, Senior Software Developer, Carnegie Mellon University CREATE Lab, and Chris Herwig, Geo Data Engineer, Google Earth Outreach
Six years ago, we first introduced Google Earth Timelapse, a global, zoomable time-lapse video that lets anyone explore our changing planet’s surface—from the global scale to the local scale. Earth Timelapse consists of 83 million multi-resolution overlapping video tiles, which are made interactively explorable through the open-source Time Machine client software developed at Carnegie Mellon University’s CREATE Lab. At its core, Google Earth Timelapse is an example of how organizing information can make it more accessible and useful, turning petabytes of satellite imagery into an interactive experience that shows the dynamic changes occurring across space and time. In April, we introduced several updates to Timelapse, including two additional years of imagery to the time-series visualization, which now spans from 1984 to 2018, with visual upgrades that make exploring more accessible and intuitive. We are especially excited that this update includes support for mobile and tablet devices, which are quickly overtaking desktop computers as the dominant source of app traffic.
Building the Global Visualization Making a planetary-sized time-lapse video required a significant amount of pixel crunching in Earth Engine, Google's cloud platform for petabyte-scale geospatial analysis. The new release followed a process similar to what we did in 2013, but at a significantly greater scale—turning 15 million satellite images acquired over the last three and a half decades from the USGS/NASA Landsat and European Sentinel programs into 35 cloud-free 4-terapixel images of the planet—one for each year from 1984 to 2018.
At its native resolution, the Timelapse visualization is a 4 terapixel video (that's four trillion pixels), which would take about 12 days to download on a 95 Mb/s internet connection. Most computers would have difficulty playing a video of this size, let alone with an interactive, zoomable interface. The problem is even more severe for a mobile device.
A solution was pioneered by Google Maps in 2004 with the map pyramiding technique. Before that time, navigating a map required the use of directional arrows to pan and zoom, with each step requiring the page to reload. The map pyramiding technique assembles the full map image displayed on-screen from tens of small 256x256 pixel non-overlapping image tiles in an array, with new tiles fetched as needed at an appropriate resolution as the user pans and zooms across the map.
A traditional Mercator map pyramid contains non-overlapping image tiles.
This works very well for maps made of static images, but less so for pyramids of video tiles, such as those used by Timelapse, since it requires a web browser to keep up to 16 videos in sync while interacting with the visualization. The solution is embodied in CREATE Lab’s open source Time Machine software: create much larger video tiles that can cover the entire screen and only show one whole-screen tile at a time. The tiles create a pyramid, where sibling tiles overlap with their neighbors to provide a seamless transition between tiles while panning and zooming. Though the overlapping tiles require the use of about 16x more videos, this pyramid structure enables the use of Timelapse on mobile devices by minimizing the amount of data required for visualization.
In our newest release, the global video pyramid consists of 83 million videos across 13 zoom levels, which required about 2 million CPU hours distributed across thousands of machines in Google Cloud to generate.
Earth Timelapse uses a pyramid of overlapping video tiles.
Time Travel, Wherever You Are Prior to April's update, ~30% of visitors to the Timelapse visualization were on mobile devices and didn't actually experience the visualization; instead they saw a YouTube playlist of locations in Timelapse. Until recently, the hardware and CPUs for phones and tablets could not decode videos fast enough without significant delays when someone attempted to zoom in or pan across a video, making mobile exploration unpleasant, if not impossible. In addition, in order for the visualization to be smooth as you pan and zoom, each video that is loaded must sync to the previously playing video and begin playing automatically. But, until only recently, mobile browser vendors had disabled video autoplay at the browser level for bandwidth reasons.
Redesigning Timelapse for Exploration Across Devices Timelapse is a tool for exploration, so we designed for immersiveness, devoting as much real estate as possible to the map. On the other hand, it's not just a map, but a map of videos. So we kept controls visible, like pausing and restarting the timeline or choosing highlights, by leveraging Material Design with simple, clean lines and clear focal areas.
Navigate with Google Maps using the new "Maps Mode" toggle.
To explore, you need to know where you are or where somewhere else is, so the new interface includes a new "Maps Mode" toggle that lets the user navigate with Google Maps. We also built in scalability to the timeline element of the UI, so that new features added in the future, such as lengthening the time-lapse or adding options for different time increments, won’t break the design. The timeline also allows the user to go backwards in time—an interesting way to compare the present with the past.
For desktop browsers supporting WebGL, we also added a new WebGL viewer to the open source project, which loads and synchronizes multiple videos to fill the screen at optimal resolution. The aesthetic improvement of this is nontrivial, with >4x better resolution.
What's next We're excited about the abundance of freely available, openly licensed satellite imagery and remote sensing data available, enabling new visualizations across time, space, and the visual and non-visual spectrum. We've found it's often the data combined with supplemental layers, such as the World Database on Protected Areas (WDPA) boundaries, that can spark new insights. For example, seeing the visual connection between declining home ownership and shifts in the city of Pittsburgh's racial makeup tells a story about inequality that numbers on a page simply cannot. Visual evidence can transcend language and cultural barriers and, we hope, generate productive conversations about our global challenges.
Acknowledgements Randy Sargent, Senior Systems Scientist, Carnegie Mellon University CREATE Lab and the Google Earth Engine team
Posted by Bartlomiej Wronski, Software Engineer and Peyman Milanfar, Lead Scientist, Computational Imaging
Digital zoom using algorithms (rather than lenses) has long been the “ugly duckling” of mobile device cameras. As compared to the optical zoom capabilities of DSLR cameras, the quality of digitally zoomed images has not been competitive, and conventional wisdom is that the complex optics and mechanisms of larger cameras can't be replaced with much more compact mobile device cameras and clever algorithms.
With the new Super Res Zoom feature on the Pixel 3, we are challenging that notion.
The Super Res Zoom technology in Pixel 3 is different and better than any previous digital zoom technique based on upscaling a crop of a single image, because we merge many frames directly onto a higher resolution picture. This results in greatly improved detail that is roughly competitive with the 2x optical zoom lenses on many other smartphones. Super Res Zoom means that if you pinch-zoom before pressing the shutter, you’ll get a lot more details in your picture than if you crop afterwards.
Crops of 2x Zoom: Pixel 2, 2017 vs. Super Res Zoom on the Pixel 3, 2018.
The Challenges of Digital Zoom Digital zoom is tough because a good algorithm is expected to start with a lower resolution image and "reconstruct" missing details reliably — with typical digital zoom a small crop of a single image is scaled up to produce a much larger image. Traditionally, this is done by linear interpolation methods, which attempt to recreate information that is not available in the original image, but introduce a blurry- or “plasticy” look that lacks texture and details. In contrast, most modern single-image upscalers use machine learning (including our own earlier work, RAISR). These magnify some specific image features such as straight edges and can even synthesize certain textures, but they cannot recover natural high-resolution details. While we still use RAISR to enhance the visual quality of images, most of the improved resolution provided by Super Res Zoom (at least for modest zoom factors like 2-3x) comes from our multi-frame approach.
Color Filter Arrays and Demosaicing Reconstructing fine details is especially difficult because digital photographs are already incomplete — they’ve been reconstructed from partial color information through a process called demosaicing. In typical consumer cameras, the camera sensor elements are meant to measure only the intensity of the light, not directly its color. To capture real colors present in the scene, cameras use a color filter array placed in front of the sensor so that each pixel measures only a single color (red, green, or blue). These are arranged in a Bayer pattern as shown in the diagram below.
A Bayer mosaic color filter. Every 2x2 group of pixels captures light filtered by a specific color — two green pixels (because our eyes are more sensitive to green), one red, and one blue. This pattern is repeated across the whole image.
A camera processing pipeline then has to reconstruct the real colors and all the details at all pixels, given this partial information.* Demosaicing starts by making a best guess at the missing color information, typically by interpolating from the colors in nearby pixels, meaning that two-thirds of an RGB digital picture is actually a reconstruction!
Demosaicing reconstructs missing color information by using neighboring neighboring pixels.
In its simplest form, this could be achieved by averaging from neighboring values. Most real demosaicing algorithms are more complicated than this, but they still lead to imperfect results and artifacts - as we are limited to only partial information. While this situation exists even for large-format DSLR cameras, their bigger sensors and larger lenses allow for more detail to be captured than is typical in a mobile camera.
The situation gets worse if you pinch-zoom on a mobile device; then algorithms are forced to make up even more information, again by interpolation from the nearby pixels. However, not all is lost. This is where burst photography and the fusion of multiple images can be used to allow for super-resolution, even when limited by mobile device optics. From Burst Photography to Multi-frame Super-resolution While a single frame doesn't provide enough information to fill in the missing colors , we can get some of this missing information from multiple images taken successively. The process of capturing and combining multiple sequential photographs is known as burst photography. Google’s HDR+ algorithm, successfully used in Nexus and Pixel phones, already uses information from multiple frames to make photos from mobile phones reach the level of quality expected from a much larger sensor; could a similar approach be used to increase image resolution?
It has been known for more than a decade, including in astronomy where the basic concept is known as “drizzle”, that capturing and combining multiple images taken from slightly different positions can yield resolution equivalent to optical zoom, at least at low magnifications like 2x or 3x and in good lighting conditions. In this process, called muti-frame super-resolution, the general idea is to align and merge low-resolution bursts directly onto a grid of the desired (higher) resolution. Here's an example of how an idealized multi-frame super-resolution algorithm might work:
As compared to the standard demosaicing pipeline that needs to interpolate the missing colors (top), ideally, one could fill some holes from multiple images, each shifted by one pixel horizontally or vertically.
In the example above, we capture 4 frames, three of them shifted by exactly one pixel: in the horizontal, vertical, and both horizontal and vertical directions. All the holes would get filled, and there would be no need for any demosaicing at all! Indeed, some DSLR cameras support this operation, but only if the camera is on a tripod, and the sensor/optics are actively moved to different positions. This is sometimes called "microstepping".
Over the years, the practical usage of this “super-res” approach to higher resolution imaging remained confined largely to the laboratory, or otherwise controlled settings where the sensor and the subject were aligned and the movement between them was either deliberately controlled or tightly constrained. For instance, in astronomical imaging, a stationary telescope sees a predictably moving sky. But in widely used imaging devices like the modern-day smartphone, the practical usage of super-res for zoom in applications like mobile device cameras has remained mostly out of reach.
This is in part due to the fact that in order for this to work properly, certain conditions need to be satisfied. First, and most important, is that the lens needs to resolve detail better than the sensor used (in contrast, you can imagine a case where the lens is so poorly-designed that adding a better sensor provides no benefit). This property is often observed as an unwanted artifact of digital cameras called aliasing.
Image Aliasing Aliasing occurs when a camera sensor is unable to faithfully represent all patterns and details present in a scene. A good example of aliasing are Moiré patterns, sometimes seen on TV as a result of an unfortunate choice of wardrobe. Furthermore, the aliasing effect on a physical feature (such as an edge of a table) changes when things move in a scene. You can observe this in the following burst sequence, where slight motions of the camera during the burst sequence create time-varying alias effects:
Left: High-resolution, single image of a table edge against a high frequency patterned background, Right: Different frames from a burst. Aliasing and Moiré effects are visible between different frames — pixels seem to jump around and produce different colored patterns.
However, this behavior is a blessing in disguise: if one analyzes the patterns produced, it gives us the variety of color and brightness values, as discussed in the previous section, to achieve super-resolution. That said, many challenges remain, as practical super-resolution needs to work with a handheld mobile phone and on any burst sequence. Practical Super-resolution Using Hand Motion As noted earlier, some DSLR cameras offer special tripod super-resolution modes that work in a way similar to what we described so far. These approaches rely on the physical movement of the sensors and optics inside the camera, but require a complete stabilization of the camera otherwise, which is impractical in mobile devices, since they are nearly always handheld. This would seem to create a catch-22 for super-resolution imaging on mobile platforms.
However, we turn this difficulty on its head, by using the hand-motion to our advantage. When we capture a burst of photos with a handheld camera or phone, there is always some movement present between the frames. Optical Image Stabilization (OIS) systems compensate for large camera motions - typically 5-20 pixels between successive frames spaced 1/30 second apart - but are unable to completely eliminate faster, lower magnitude, natural hand tremor, which occurs for everyone (even those with “steady hands”). When taking photos using mobile phones with a high resolution sensor, this hand tremor has a magnitude of just a few pixels.
Effect of hand tremor as seen in a cropped burst, after global alignment.
To take advantage of hand tremor, we first need to align the pictures in a burst together. We choose a single image in the burst as the “base” or reference frame, and align every other frame relative to it. After alignment, the images are combined together roughly as in the diagram shown earlier in this post. Of course, handshake is unlikely to move the image by exactly single pixels, so we need to interpolate between adjacent pixels in each newly captured frame before injecting the colors into the pixel grid of our base frame.
When hand motion is not present because the device is completely stabilized (e.g. placed on a tripod), we can still achieve our goal of simulating natural hand motion by intentionally “jiggling” the camera, by forcing the OIS module to move slightly between the shots. This movement is extremely small and chosen such that it doesn’t interfere with normal photos - but you can observe it yourself on Pixel 3 by holding the phone perfectly still, such as by pressing it against a window, and maximally pinch-zooming the viewfinder. Look for a tiny but continuous elliptical motion in distant objects, like that shown below.
Overcoming the Challenges of Super-resolution The description of the ideal process we gave above sounds simple, but super-resolution is not that easy — there are many reasons why it hasn’t widely been used in consumer products like mobile phones, and requires the development of significant algorithmic innovations. Challenges can include:
A single image from a burst is noisy, even in good lighting. A practical super-resolution algorithm needs to be aware of this noise and work correctly despite it. We don’t want to get just a higher resolution noisy image - our goal is to both increase the resolution but also produce a much less noisy result.
Left: Single frame frame from a burst taken in good light conditions can still contain a substantial amount of noise due to underexposure. Right: Result of merging multiple frames after burst processing.
Motion between images in a burst is not limited to just the movement of the camera. There can be complex motions in the scene such as wind-blown leaves, ripples moving across the surface of water, cars, people moving or changing their facial expressions, or the flicker of a flame — even some movements that cannot be assigned a single, unique motion estimate because they are transparent or multi-layered, such as smoke or glass. Completely reliable and localized alignment is generally not possible, and therefore a good super-resolution algorithm needs to work even if motion estimation is imperfect.
Because much of motion is random, even if there is good alignment, the data may be dense in some areas of the image and sparse in others. The crux of super-resolution is a complex interpolation problem, so the irregular spread of data makes it challenging to produce a higher-resolution image in all parts of the grid.
All the above challenges would seem to make real-world super-resolution either infeasible in practice, or at best limited to only static scenes and a camera placed on a tripod. With Super Res Zoom on Pixel 3, we’ve developed a stable and accurate burst resolution enhancement method that uses natural hand motion, and is robust and efficient enough to deploy on a mobile phone.
Here’s how we’ve addressed some of these challenges:
To effectively merge frames in a burst, and to produce a red, green, and blue value for every pixel without the need for demosaicing, we developed a method of integrating information across the frames that takes into account the edges of the image, and adapts accordingly. Specifically, we analyze the input frames and adjust how we combine them together, trading off increase in detail and resolution vs. noise suppression and smoothing. We accomplish this by merging pixels along the direction of apparent edges, rather than across them. The net effect is that our multi-frame method provides the best practical balance between noise reduction and enhancement of details.
Left: Merged image with sub-optimal tradeoff of noise reduction and enhanced resolution. Right: The same merged image with a better tradeoff.
To make the algorithm handle scenes with complex local motion (people, cars, water or tree leaves moving) reliably, we developed a robustness model that detects and mitigates alignment errors. We select one frame as a “reference image”, and merge information from other frames into it only if we’re sure that we have found the correct corresponding feature. In this way, we can avoid artifacts like “ghosting” or motion blur, and wrongly merged parts of the image.
A fast moving bus in a burst of images. Left: Merge without robustness model. Right: Merge with robustness model.
Pushing the State of the Art in Mobile Photography The Portrait mode last year, and the HDR+ pipeline before it, showed how good mobile photography can be. This year, we set out to do the same for zoom. That’s another step in advancing the state of the art in computational photography, while shrinking the quality gap between mobile photography and DSLRs. Here is an album containing full FOV images, followed by Super Res Zoom images. Note that the Super Res Zoom images in this album are not cropped — they are captured directly on-device using pinch-zoom.
Left: Crop of 7x zoomed image on Pixel 2. Right: Same crop from Super Res Zoom on Pixel 3.
The idea of super-resolution predates the advent of smart-phones by at least a decade. For nearly as long, it has also lived in the public imagination through films and television. It’s been the subject of thousands of papers in academic journals and conferences. Now, it is real — in the palm of your hands, in Pixel 3.
An illustrative animation of Super Res Zoom. When the user takes a zoomed photo, the Pixel 3 takes advantage of the user’s natural hand motion and captures a burst of images at subtly different positions. These are then merged together to add detail to the final image.
Acknowledgements Super Res Zoom is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of teams managed by Peyman Milanfar, Marc Levoy, and Bill Freeman. The authors would like to thank Marc Levoy and Isaac Reynolds in particular for their assistance in the writing of this blog.
The authors wish to especially acknowledge the following key contributors to the Super Res Zoom project: Ignacio Garcia-Dorado, Haomiao Jiang, Manfred Ernst, Michael Krainin, Daniel Vlasic, Jiawen Chen, Pascal Getreuer, and Chia-Kai Liang. The project also benefited greatly from contributions and feedback by Ce Liu, Damien Kelly, and Dillon Sharlet.
How to get the most out of Super Res Zoom? Here are some tips on getting the best of Super Res Zoom on a Pixel 3 phone:
Pinch and zoom, or use the + button to increase zoom by discrete steps.
Double-tap the preview to quickly toggle between zoomed in and zoomed out.
Super Res works well at all zoom factors, though for performance reasons, it activates only above 1.2x. That’s about half way between no zoom and the first “click” in the zoom UI.
There are fundamental limits to the optical resolution of a wide-angle camera. So to get the most out of (any) zoom, keep the magnification factor modest.
Avoid fast moving objects. Super Res zoom will capture them correctly, but you will not likely get increased resolution.
* It’s worth noting that the situation is similar in some ways to how we see — in human (and other mammalian) eyes, different eye cone cells are sensitive to some specific colors, with the brain filling in the details to reconstruct the full image.↩
Posted by Hossein Talebi, Software Engineer and Peyman Milanfar Research Scientist, Machine Perception
Quantification of image quality and aesthetics has been a long-standing problem in image processing and computer vision. While technical quality assessment deals with measuring pixel-level degradations such as noise, blur, compression artifacts, etc., aesthetic assessment captures semantic level characteristics associated with emotions and beauty in images. Recently, deep convolutional neural networks (CNNs) trained with human-labelled data have been used to address the subjective nature of image quality for specific classes of images, such as landscapes. However, these approaches can be limited in their scope, as they typically categorize images to two classes of low and high quality. Our proposed method predicts the distribution of ratings. This leads to a more accurate quality prediction with higher correlation to the ground truth ratings, and is applicable to general images.
In “NIMA: Neural Image Assessment” we introduce a deep CNN that is trained to predict which images a typical user would rate as looking good (technically) or attractive (aesthetically). NIMA relies on the success of state-of-the-art deep object recognition networks, building on their ability to understand general categories of objects despite many variations. Our proposed network can be used to not only score images reliably and with high correlation to human perception, but also it is useful for a variety of labor intensive and subjective tasks such as intelligent photo editing, optimizing visual quality for increased user engagement, or minimizing perceived visual errors in an imaging pipeline.
Background In general, image quality assessment can be categorized into full-reference and no-reference approaches. If a reference “ideal” image is available, image quality metrics such as PSNR, SSIM, etc. have been developed. When a reference image is not available, “blind” (or no-reference) approaches rely on statistical models to predict image quality. The main goal of both approaches is to predict a quality score that correlates well with human perception. In a deep CNN approach to image quality assessment, weights are initialized by training on object classification related datasets (e.g. ImageNet), and then fine-tuned on annotated data for perceptual quality assessment tasks.
NIMA Typical aesthetic prediction methods categorize images as low/high quality. This is despite the fact that each image in the training data is associated to a histogram of human ratings, rather than a single binary score. A histogram of ratings is an indicator of overall quality of an image, as well as agreements among raters. In our approach, instead of classifying images a low/high score or regressing to the mean score, the NIMA model produces a distribution of ratings for any given image — on a scale of 1 to 10, NIMA assigns likelihoods to each of the possible scores. This is more directly in line with how training data is typically captured, and it turns out to be a better predictor of human preferences when measured against other approaches (more details are available in our paper).
Various functions of the NIMA vector score (such as the mean) can then be used to rank photos aesthetically. Some test photos from the large-scale database for Aesthetic Visual Analysis (AVA) dataset, as ranked by NIMA, are shown below. Each AVA photo is scored by an average of 200 people in response to photography contests. After training, the aesthetic ranking of these photos by NIMA closely matches the mean scores given by human raters. We find that NIMA performs equally well on other datasets, with predicted quality scores close to human ratings.
Ranking some examples labelled with the “landscape” tag from AVA dataset using NIMA. Predicted NIMA (and ground truth) scores are shown below each image.
NIMA scores can also be used to compare the quality of images of the same subject which may have been distorted in various ways. Images shown in the following example are part of the TID2013 test set, which contain various types and levels of distortions.
Ranking some examples from TID2013 dataset using NIMA. Predicted NIMA scores are shown below each image.
Perceptual Image Enhancement As we’ve shown in another recent paper, quality and aesthetic scores can also be used to perceptually tune image enhancement operators. In other words, maximizing NIMA score as part of a loss function can increase the likelihood of enhancing perceptual quality of an image. The following example shows that NIMA can be used as a training loss to tune a tone enhancement algorithm. We observed that the baseline aesthetic ratings can be improved by contrast adjustments directed by the NIMA score. Consequently, our model is able to guide a deep CNN filter to find aesthetically near-optimal settings of its parameters, such as brightness, highlights and shadows.
NIMA can be used as a training loss to enhance images. In this example, local tone and contrast of images is enhanced by training a deep CNN with NIMA as its loss. Test images are obtained from the MIT-Adobe FiveK dataset.
Looking Ahead Our work on NIMA suggests that quality assessment models based on machine learning may be capable of a wide range of useful functions. For instance, we may enable users to easily find the best pictures among many; or to even enable improved picture-taking with real-time feedback to the user. On the post-processing side, these models may be used to guide enhancement operators to produce perceptually superior results. In a direct sense, the NIMA network (and others like it) can act as reasonable, though imperfect, proxies for human taste in photos and possibly videos. We’re excited to share these results, though we know that the quest to do better in understanding what quality and aesthetics mean is an ongoing challenge — one that will involve continuing retraining and testing of our models.
Posted by Behnam Bastani, Software Engineer Manager and Eric Turner, Software Engineer, Daydream
Virtual Reality (VR) and Mixed Reality (MR) offer a novel way to immerse people into new and compelling experiences, from gaming to professional training. However, current VR/MR technologies present a fundamental challenge: to present images at the extremely high resolution required for immersion places enormous demands on the rendering engine and transmission process. Headsets often have insufficient display resolution, which can limit the field of view, worsening the experience. But, to drive a higher resolution headset, the traditional rendering pipeline requires significant processing power that even high-end mobile processors cannot achieve. As research continues to deliver promising new techniques to increase display resolution, the challenges of driving those displays will continue to grow.
In order to further improve the visual experience in VR and MR, we introduce a pipeline that takes advantage of the characteristics of human visual perception to enable a amazing visual experience at low compute and power cost. The pipeline proposed in this article considers the full system dependency including the rendering engine, memory bandwidth and capability of display module itself. We determined that the current limitation is not just in the content creation, but it also may be in transmitting data, handling latency and enabling interaction with real objects (mixed reality applications). The pipeline consists of 1. Foveated Rendering with a focus on reducing of compute per pixel. 2. Foveated Image Processing with a focus on the reduction of visual artifacts and 3. Foveated Transmission with a focus on bits per pixel transmitted.
Foveated Rendering In the human visual system, the fovea centralis allows us to see at high-fidelity in the center of our vision, allowing our brain to pay less attention to things in our peripheral vision. Foveated rendering takes advantage of this characteristic to improve the performance of the rendering engine by reducing the spatial or bit-depth resolution of objects in our peripheral vision. To make this work, the location of the High Acuity (HA) region needs to be updated with eye-tracking to align with eye saccades, which preserves the perception of a constant high-resolution across the field of view. In contrast, systems with no eye-tracking may need to render a much larger HA region.
The left image is rendered at full resolution. The right image uses two layers of foveation — one rendered at high resolution (inside the yellow region) and one at lower resolution (outside).
A traditional foveation technique may divide a frame buffer into multiple spatial resolution regions. Aliasing introduced by rendering to lower spatial resolution may cause perceptible temporal artifacts when there is motion in the content due to head motion or animation. Below we show an example of temporal artifacts introduced by head rotation.
A smooth full rendering (image on the left). The image on the right shows temporal artifacts introduced by motion in foveated region.
In the following sections, we present two different methods we use aimed at reducing these artifacts: Phase-Aligned Foveated Rendering and Conformal Foveated Rendering. Each of these methods provide different benefits for visual quality during rendering and are useful under different conditions.
Phase-Aligned Rendering Aliasing occurs in the Low-Acuity (LA) region during foveated rendering due to the subsampling of rendered content. In traditional foveated rendering discussed above, these aliasing artifacts flicker from frame to frame, since the display pixel grid moves across the virtual scene as the user moves their head. The motion of these pixels relative to the scene cause any existing aliasing artifacts to flicker, which is highly perceptible to the user, even in the periphery.
In Phase-Aligned rendering, we force the LA region frustums to be aligned rotationally to the world (e.g. always facing north, east, south, etc.), not the current frame's head-rotation. The aliasing artifacts are mostly invariant to head pose and therefore much less detectable. After upsampling, these regions are then reprojected onto the final display screen to compensate for the user's head rotation, which reduces temporal flicker. As with traditional foveation, we render the high-acuity region in a separate pass, and overlay it onto the merged image at the location of the fovea. The figure below compares traditional foveated rendering with phase-aligned rendering, both at the same level of foveation.
Temporal artifacts in non-world aligned foveated rendered content (left) and the phase-aligned method (right).
This method gives a major benefit to reducing the severity of visual artifacts during foveated rendering. Although phase-aligned rendering is more expensive to compute than traditional foveation under the same level of acuity reduction, we can still yield a net savings by pushing foveation to more aggressive levels that would otherwise have yielded too many artifacts. Conformal Rendering Another approach for foveated rendering is to render content in a space that matches the smoothly varying reduction in resolution of our visual acuity, based on a nonlinear mapping of the screen distance from the visual fixation point.
This method gives two main benefits. First, by more closely matching the visual fidelity fall-off of the human eye, we can reduce the total number of pixels computed compared to other foveation techniques. Second, by using a smooth fall-off in fidelity, we prevent the user from seeing a clear dividing line between High-Acuity and Low-Acuity, which is often one of the first artifacts that is noticed. These benefits allow for aggressive foveation to be used while preserving the same quality levels, yielding more savings.
We perform this method by warping the vertices of the virtual scene into non-linear space. This scene is then rasterized at a reduced resolution, then unwarped into linear space as a post-processing effect combined with lens distortion correction.
Comparison of traditional foveation (left) to conformal rendering (right), where content is rendered to a space matched to visual perception acuity and HMD lens characteristics. Both methods use the same number of total pixels.
A major benefit of this method over the phase-aligned method above is that conformal rendering only requires a single pass of rasterization. For scenes with lots of vertices, this difference can provide major savings. Additionally, although phase-aligned rendering reduces flicker, it still produces a distinct boundary between the high- and low-acuity regions, whereas conformal rendering does not show this artifact. However, a downside of conformal rendering compared to phase-alignment is that aliasing artifacts still flicker in the periphery, which may be less desirable for applications that require high visual fidelity. Foveated Image Processing HMDs often require image processing steps to be performed after rendering, such as local tone mapping, lens distortion correction, or lighting blending. With foveated image-processing, different operations are applied for different foveation regions. As an example, lens distortion correction, including chromatic aberration correction, may not require the same spatial accuracy for each part of the display. By running lens distortion correction on foveated content before upscaling, significant savings are gained in computation. This technique does not introduce perceptible artifacts.
Correction for head-mounted-display lens chromatic aberration in foveated space. Top image shows the conventional pipeline. The bottom image (in Green) shows the operation in the foveated space.
The left image shows reconstructed foveated content after lens distortion. The right image shows image difference when lens distortion correction is performed in a foveated manner. The right image shows that minimal error is introduced close to edges of frame buffer. These errors are imperceptible in an HMD.
Foveated Transmission A non-trivial source of power consumption for standalone HMDs is data transmission from the system-on-a-chip (SoC) to the display module. Foveated transmission aims to save power and bandwidth by transmitting the minimum amount of data necessary to the display as shown in figure below.
Rather than streaming upscaled foveated content (left image), foveated transmission enables streaming content pre-reconstruction (right image) and reducing the number of bits transmitted.
This change requires moving the simple upscaling and blending operations to the display side and transmitting only the foveated rendered content. Complexity arises if the foveal region, the red box in above figure, moves with eyetracking. Such motion may cause temporal artifacts (figure below) since Display Stream Compression (DSC) used between SoC and the display is not designed for foveated content.
Comparison of full integration of foveation and compression techniques (left) versus typical flickering artifacts that may be introduced by applying DSC to foveated content (right).
Toward a New Pipeline We have focused on a few components of a “foveation pipeline” for MR and VR applications. By considering the impact of foveation in every part of a display system — rendering, processing and transmission — we can enable the next generation of lightweight, low-power, and high resolution MR/VR HMDs. This topic has been an active area of research for many years and it seems reasonable to expect the appearance of VR and MR headsets with foveated pipelines in the coming years.
Acknowledgements We would like to recognize the work done by the following collaborators:
Haomiao Jiang and Carlin Vieri on display compression and foveated transmission
Brian Funt and Sylvain Vignaud on the development of new foveated rendering algorithms
Posted by Hui Fang, Software Engineer, Machine Perception
Machine learning (ML) excels in many areas with well defined goals. Tasks where there exists a right or wrong answer help with the training process and allow the algorithm to achieve its desired goal, whether it be correctly identifying objects in images or providing a suitable translation from one language to another. However, there are areas where objective evaluations are not available. For example, whether a photograph is beautiful is measured by its aesthetic value, which is a highly subjective concept.
A professional(?) photograph of Jasper National Park, Canada.
To explore how ML can learn subjective concepts, we introduce an experimental deep-learning system for artistic content creation. It mimics the workflow of a professional photographer, roaming landscape panoramas from Google Street View and searching for the best composition, then carrying out various postprocessing operations to create an aesthetically pleasing image. Our virtual photographer “travelled” ~40,000 panoramas in areas like Alps, Banff and Jasper National Parks, big sur in California, and Yellowstone National Park, returned with creations that are quite impressive, some even approaching professional quality -- as judged by professional photographers.
Training the Model While aesthetics can be modelled using datasets like AVA, using it naively to enhance photos may miss some aspect in aesthetics, such as making a photo over-saturated. Using supervised learning to learn multiple aspects in aesthetics properly, however, may require a labelled dataset that is intractable to collect.
Our approach relies only on a collection of professional quality photos, without before/after image pairs, or any additional labels. It breaks down aesthetics into multiple aspects automatically, each of which is learned individually with negative examples generated by a coupled image operation. By keeping these image operations semi-”orthogonal”, we can enhance a photo on its composition, saturation/HDR level and dramatic lighting with fast and separable optimizations:
A panorama (a) is cropped into (b), with saturation and HDR strength enhanced in (c), and with dramatic mask applied in (d). Each step is guided by one learned aspect of aesthetics.
A traditional image filter was used to generate negative training examples for saturation, HDR detail and composition. We also introduce a special operation named dramatic mask, which was created jointly while learning the concept of dramatic lighting. The negative examples were generated by applying a combination of image filters that modify brightness randomly on professional photos, degrading their appearance. For the training we use a generative adversarial network (GAN), where a generative model creates a mask to fix lighting for negative examples, while a discriminative model tries to distinguish enhanced results from the real professional ones. Unlike shape-fixed filters such as vignette, dramatic mask adds content-aware brightness adjustment to a photo. The competitive nature of GAN training leads to good variations of such suggestions. You can read more about the training details in our paper.
Results Some creations of our system from Google Street View are shown below. As you can see, the application of the trained aesthetic filters creates some dramatic results (including the image we started this post with!):
Jasper National Park, Canada.
Park Parco delle Orobie Bergamasche, Italy.
Jasper National Park, Canada.
Professional Evaluation To judge how successful our algorithm was, we designed a “Turing-test”-like experiment: we mix our creations with other photos at different quality, and show them to several professional photographers. They were instructed to assign a quality score for each of them, with meaning defined as following:
1: Point-and-shoot without consideration for composition, lighting etc.
2: Good photos from general population without a background in photography. Nothing artistic stands out.
3: Semi-pro. Great photos showing clear artistic aspects. The photographer is on the right track of becoming a professional.
In the following chart, each curve shows scores from professional photographers for images within a certain predicted score range. For our creations with a high predicted score, about 40% ratings they received are at “semi-pro” to “pro” levels.
Scores received from professional photographers for photos with different predicted scores.
Future Work The Street View panoramas served as a testing bed for our project. Someday this technique might even help you to take better photos in the real world. We compiled a showcase of photos created to our satisfaction. If you see a photo you like, you can click on it to bring out a nearby Street View panorama. Would you make the same decision if you were there holding the camera at that moment?
Acknowledgements This work was done by Hui Fang and Meng Zhang from Machine Perception at Google Research. We would like to thank Vahid Kazemi for his earlier work in predicting AVA scores using Inception network, and Sagarika Chalasani, Nick Beato, Bryan Klingner and Rupert Breheny for their help in processing Google Street View panoramas. We would like to thank Peyman Milanfar, Tomas Izo, Christian Szegedy, Jon Barron and Sergey Ioffe for their helpful reviews and comments. Huge thanks to our anonymous professional photographers!
Posted by Florian Kainz, Software Engineer, Google Daydream
On a full moon night last year I carried a professional DSLR camera, a heavy lens and a tripod up to a hilltop in the Marin Headlands just north of San Francisco to take a picture of the Golden Gate Bridge and the lights of the city behind it.
A view of the Golden Gate Bridge from the Marin Headlands, taken with a DSLR camera (Canon 1DX, Zeiss Otus 28mm f/1.4 ZE). Click here for the full resolution image.
I thought the photo of the moonlit landscape came out well so I showed it to my (then) teammates in Gcam, a Google Research team that focuses on computational photography - developing algorithms that assist in taking pictures, usually with smartphones and similar small cameras. Seeing my nighttime photo, one of the Gcam team members challenged me to re-take it, but with a phone camera instead of a DSLR. Even though cameras on cellphones have come a long way, I wasn’t sure whether it would be possible to come close to the DSLR shot.
Probably the most successful Gcam project to date is the image processing pipeline that enables the HDR+ mode in the camera app on Nexus and Pixel phones. HDR+ allows you to take photos at low-light levels by rapidly shooting a burst of up to ten short exposures and averaging them them into a single image, reducing blur due to camera shake while collecting enough total light to yield surprisingly good pictures. Of course there are limits to what HDR+ can do. Once it gets dark enough the camera just cannot gather enough light and challenging shots like nighttime landscapes are still beyond reach.
The Challenges To learn what was possible with a cellphone camera in extremely low-light conditions, I looked to the experimental SeeInTheDark app, written by Marc Levoy and presented at the ICCV 2015 Extreme Imaging Workshop, which can produce pictures with even less light than HDR+. It does this by accumulating more exposures, and merging them under the assumption that the scene is static and any differences between successive exposures must be due to camera motion or sensor noise. The app reduces noise further by dropping image resolution to about 1 MPixel. With SeeInTheDark it is just possible to take pictures, albeit fairly grainy ones, by the light of the full moon.
However, in order to keep motion blur due to camera shake and moving objects in the scene at acceptable levels, both HDR+ and SeeInTheDark must keep the exposure times for individual frames below roughly one tenth of a second. Since the user can’t hold the camera perfectly still for extended periods, it doesn’t make sense to attempt to merge a large number of frames into a single picture. Therefore, HDR+ merges at most ten frames, while SeeInTheDark progressively discounts older frames as new ones are captured. This limits how much light the camera can gather and thus affects the quality of the final pictures at very low light levels.
Of course, if we want to take high-quality pictures of low-light scenes (such as a landscape illuminated only by the moon), increasing the exposure time to more than one second and mounting the phone on a tripod or placing it on some other solid support makes the task a lot easier. Google’s Nexus 6P and Pixel phones support exposure times of 4 and 2 seconds respectively. As long as the scene is static, we should be able to record and merge dozens of frames to produce a single final image, even if shooting those frames takes several minutes.
Even with the use of a tripod, a sharp picture requires the camera’s lens to be focused on the subject, and this can be tricky in scenes with very low light levels. The two autofocus mechanisms employed by cellphone cameras — contrast detection and phase detection — fail when it’s dark enough that the camera's image sensor returns mostly noise. Fortunately, the interesting parts of outdoor scenes tend to be far enough away that simply setting the focus distance to infinity produces sharp images.
Experiments & Results Taking all this into account, I wrote a simple Android camera app with manual control over exposure time, ISO and focus distance. When the shutter button is pressed the app waits a few seconds and then records up to 64 frames with the selected settings. The app saves the raw frames captured from the sensor as DNG files, which can later be downloaded onto a PC for processing.
To test my app, I visited the Point Reyes lighthouse on the California coast some thirty miles northwest of San Francisco on a full moon night. I pointed a Nexus 6P phone at the building and shot a burst of 32 four-second frames at ISO 1600. After covering the camera lens with opaque adhesive tape I shot an additional 32 black frames. Back at the office I loaded the raw files into Photoshop. The individual frames were very grainy, as one would expect given the tiny sensor in a cellphone camera, but computing the mean of all 32 frames cleaned up most of the grain, and subtracting the mean of the 32 black frames removed faint grid-like patterns caused by local variations in the sensor's black level. The resulting image, shown below, looks surprisingly good.
Point Reyes lighthouse at night, photographed with Google Nexus 6P (full resolution image here).
The lantern in the lighthouse is overexposed, but the rest of the scene is sharp, not too grainy, and has pleasing, natural looking colors. For comparison, a hand-held HDR+ shot of the same scene looks like this:
Point Reyes Lighthouse at night, hand-held HDR+ shot (full resolution image here). The inset rectangle has been brightened in Photoshop to roughly match the previous picture.
Satisfied with these results, I wanted to see if I could capture a nighttime landscape as well as the stars in the clear sky above it, all in one picture. When I took the photo of the lighthouse a thin layer of clouds conspired with the bright moonlight to make the stars nearly invisible, but on a clear night a two or four second exposure can easily capture the brighter stars. The stars are not stationary, though; they appear to rotate around the celestial poles, completing a full turn every 24 hours. The motion is slow enough to be invisible in exposures of only a few seconds, but over the minutes it takes to record a few dozen frames the stars move enough to turn into streaks when the frames are merged. Here is an example:
The North Star above Mount Burdell, single 2-second exposure. (full resolution image here).
Mean of 32 2-second exposures (full resolution image here).
Seeing streaks instead of pinpoint stars in the sky can be avoided by shifting and rotating the original frames such that the stars align. Merging the aligned frames produces an image with a clean-looking sky, and many faint stars that were hidden by noise in the individual frames become visible. Of course, the ground is now motion-blurred as if the camera had followed the rotation of the sky.
Mean of 32 2-second exposures, stars aligned (full resolution image here).
We now have two images; one where the ground is sharp, and one where the sky is sharp, and we can combine them into a single picture that is sharp everywhere. In Photoshop the easiest way to do that is with a hand-painted layer mask. After adjusting brightness and colors to taste, slight cropping, and removing an ugly "No Trespassing" sign we get a presentable picture:
The North Star above Mount Burdell, shot with Google Pixel, final image (full resolution image here).
Using Even Less Light The pictures I've shown so far were shot on nights with a full moon, when it was bright enough that one could easily walk outside without a lantern or a flashlight. I wanted to find out if it was possible to take cellphone photos in even less light. Using a Pixel phone, I tried a scene illuminated by a three-quarter moon low in the sky, and another one with no moon at all. Anticipating more noise in the individual exposures, I shot 64-frame bursts. The processed final images still look fine:
Wrecked fishing boat in Inverness and the Big Dipper, 64 2-second exposures, shot with Google Pixel (full resolution image here).
Stars above Pierce Point Ranch, 64 2-second exposures, shot with Google Pixel (full resolution image here).
In the second image the distant lights of the cities around the San Francisco Bay caused the sky near the horizon to glow, but without moonlight the night was still dark enough to make the Milky Way visible. The picture looks noticeably grainier than my earlier moonlight shots, but it's not too bad.
Pushing the Limits How far can we go? Can we take a cellphone photo with only starlight - no moon, no artificial light sources nearby, and no background glow from a distant city?
To test this I drove to a point on the California coast a little north of the mouth of the Russian River, where nights can get really dark, and pointed my Pixel phone at the summer sky above the ocean. Combining 64 two-second exposures taken at ISO 12800, and 64 corresponding black black frames did produce a recognizable image of the Milky Way. The constellations Scorpius and Sagittarius are clearly visible, and squinting hard enough one can just barely make out the horizon and one or two rocks in the ocean, but overall, this is not a picture you'd want to print out and frame. Still, this may be the lowest-light cellphone photo ever taken.
Only starlight, shot with Google Pixel (full resolution image here).
Here we are approaching the limits of what the Pixel camera can do. The camera cannot handle exposure times longer than two seconds. If this restriction was removed we could expose individual frames for eight to ten seconds, and the stars still would not show noticeable motion blur. With longer exposures we could lower the ISO setting, which would significantly reduce noise in the individual frames, and we would get a correspondingly cleaner and more detailed final picture.
Getting back to the original challenge - using a cellphone to reproduce a night-time DSLR shot of the Golden Gate - I did that. Here is what I got:
Golden Gate Bridge at night, shot with Google Nexus 6P (full resolution image here).
The Moon above San Francisco, shot with Google Nexus 6P (full resolution image here).
At 9 to 10 MPixels the resolution of these pictures is not as high as what a DSLR camera might produce, but otherwise image quality is surprisingly good: the photos are sharp all the way into the corners, there is not much visible noise, the captured dynamic range is sufficient to avoid saturating all but the brightest highlights, and the colors are pleasing.
Trying to find out if phone cameras might be suitable for outdoor nighttime photography was a fun experiment, and clearly the result is yes, they are. However, arriving at the final images required a lot of careful post-processing on a desktop computer, and the procedure is too cumbersome for all but the most dedicated cellphone photographers. However, with the right software a phone should be able to process the images internally, and if steps such as painting layer masks by hand can be eliminated, it might be possible to do point-and-shoot photography in very low light conditions. Almost - the cellphone would still have to rest on the ground or be mounted on a tripod.
At Google, we care about giving users the best possible online experience, both through our own services and products and by contributing new tools and industry standards for use by the online community. That’s why we’re excited to announce Guetzli, a new open source algorithm that creates high quality JPEG images with file sizes 35% smaller than currently available methods, enabling webmasters to create webpages that can load faster and use even less data.
Guetzli [guɛtsli] — cookie in Swiss German — is a JPEG encoder for digital images and web graphics that can enable faster online experiences by producing smaller JPEG files while still maintaining compatibility with existing browsers, image processing applications and the JPEG standard. From the practical viewpoint this is very similar to our Zopfli algorithm, which produces smaller PNG and gzip files without needing to introduce a new format, and different than the techniques used in RNN-based image compression, RAISR, and WebP, which all need client and ecosystem changes for compression gains at internet scale.
The visual quality of JPEG images is directly correlated to its multi-stage compression process: color space transform, discrete cosine transform, and quantization. Guetzli specifically targets the quantization stage in which the more visual quality loss is introduced, the smaller the resulting file. Guetzli strikes a balance between minimal loss and file size by employing a search algorithm that tries to overcome the difference between the psychovisual modeling of JPEG's format, and Guetzli’s psychovisual model, which approximates color perception and visual masking in a more thorough and detailed way than what is achievable by simpler color transforms and the discrete cosine transform. However, while Guetzli creates smaller image file sizes, the tradeoff is that these search algorithms take significantly longer to create compressed images than currently available methods.
Figure 1. 16x16 pixel synthetic example of a phone line hanging against a blue sky — traditionally a case where JPEG compression algorithms suffer from artifacts. Uncompressed original is on the left. Guetzli (on the right) shows less ringing artefacts than libjpeg (middle) and has a smaller file size.
And while Guetzli produces smaller image file sizes without sacrificing quality, we additionally found that in experiments where compressed image file sizes are kept constant that human raters consistently preferred the images Guetzli produced over libjpeg images, even when the libjpeg files were the same size or even slightly larger. We think this makes the slower compression a worthy tradeoff.
Figure 2. 20x24 pixel zoomed areas from a picture of a cat’s eye. Uncompressed original on the left. Guetzli (on the right) shows less ringing artefacts than libjpeg (middle) without requiring a larger file size.
It is our hope that webmasters and graphic designers will find Guetzli useful and apply it to their photographic content, making users’ experience smoother on image-heavy websites in addition to reducing load times and bandwidth costs for mobile users. Last, we hope that the new explicitly psychovisual approach in Guetzli will inspire further image and video compression research.
Posted by Nick Johnston and David Minnen, Software Engineers
Data compression is used nearly everywhere on the internet - the videos you watch online, the images you share, the music you listen to, even the blog you're reading right now. Compression techniques make sharing the content you want quick and efficient. Without data compression, the time and bandwidth costs for getting the information you need, when you need it, would be exorbitant!
We introduce an architecture that uses a new variant of the Gated Recurrent Unit (a type of RNN that allows units to save activations and process sequences) called Residual Gated Recurrent Unit (Residual GRU). Our Residual GRU combines existing GRUs with the residual connections introduced in "Deep Residual Learning for Image Recognition" to achieve significant image quality gains for a given compression rate. Instead of using a DCT to generate a new bit representation like many compression schemes in use today, we train two sets of neural networks - one to create the codes from the image (encoder) and another to create the image from the codes (decoder).
Our system works by iteratively refining a reconstruction of the original image, with both the encoder and decoder using Residual GRU layers so that additional information can pass from one iteration to the next. Each iteration adds more bits to the encoding, which allows for a higher quality reconstruction. Conceptually, the network operates as follows:
The initial residual, R, corresponds to the original image I: R = I.
Set i=1 for to the first iteration.
Iteration[i] takes R[i-1] as input and runs the encoder and binarizer to compress the image into B[i].
Iteration[i] runs the decoder on B[i] to generate a reconstructed image P[i].
The residual for Iteration[i] is calculated: R[i] = I - P[i].
Set i=i+1 and go to Step 3 (up to the desired number of iterations).
The residual image represents how different the current version of the compressed image is from the original. This image is then given as input to the network with the goal of removing the compression errors from the next version of the compressed image. The compressed image is now represented by the concatenation of B through B[N]. For larger values of N, the decoder gets more information on how to reduce the errors and generate a higher quality reconstruction of the original image.
To understand how this works, consider the following example of the first two iterations of the image compression network, shown in the figures below. We start with an image of a lighthouse. On the first pass through the network, the original image is given as an input (R = I). P is the reconstructed image. The difference between the original image and encoded image is the residual, R, which represents the error in the compression.
Left: Original image, I = R. Center: Reconstructed image, P. Right: the residual, R, which represents the error introduced by compression.
On the second pass through the network, R is given as the network’s input (see figure below). A higher quality image P is then created. So how does the system recreate such a good image (P, center panel below) from the residual R? Because the model uses recurrent nodes with memory, the network saves information from each iteration that it can use in the next one. It learned something about the original image in Iteration that is used along with R to generate a better P from B. Lastly, a new residual, R (right), is generated by subtracting P from the original image. This time the residual is smaller since there are fewer differences between the reconstructed image, and what we started with.
The second pass through the network. Left: R is given as input. Center: A higher quality reconstruction, P. Right: A smaller residual R is generated by subtracting P from the original image.
At each further iteration, the network gains more information about the errors introduced by compression (which is captured by the residual image). If it can use that information to predict the residuals even a little bit, the result is a better reconstruction. Our models are able to make use of the extra bits up to a point. We see diminishing returns, and at some point the representational power of the network is exhausted.
To demonstrate file size and quality differences, we can take a photo of Vash, a Japanese Chin, and generate two compressed images, one JPEG and one Residual GRU. Both images target a perceptual similarity of 0.9 MS-SSIM, a perceptual quality metric that reaches 1.0 for identical images. The image generated by our learned model results in an file 25% smaller than JPEG.
Left: Original image (1419 KB PNG) at ~1.0 MS-SSIM. Center: JPEG (33 KB) at ~0.9 MS-SSIM. Right: Residual GRU (24 KB) at ~0.9 MS-SSIM. This is 25% smaller for a comparable image quality
Taking a look around his nose and mouth, we see that our method doesn’t have the magenta blocks and noise in the middle of the image as seen in JPEG. This is due to the blocking artifacts produced by JPEG, whereas our compression network works on the entire image at once. However, there's a tradeoff -- in our model the details of the whiskers and texture are lost, but the system shows great promise in reducing artifacts.
While today’s commonly used codecs perform well, our work shows that using neural networks to compress images results in a compression scheme with higher quality and smaller file sizes. To learn more about the details of our research and a comparison of other recurrent architectures, check out our paper. Our future work will focus on even better compression quality and faster models, so stay tuned!
RenderScript has a very powerful ability called Intrinsics. Intrinsics are built-in functions that perform well-defined operations often seen in image processing. Intrinsics can be very helpful to you because they provide extremely high-performance implementations of standard functions with a minimal amount of code.
RenderScript intrinsics will usually be the fastest possible way for a developer to perform these operations. We’ve worked closely with our partners to ensure that the intrinsics perform as fast as possible on their architectures — often far beyond anything that can be achieved in a general-purpose language.
Table 1. RenderScript intrinsics and the operations they provide.
This example creates a RenderScript context and a Blur intrinsic. It then uses the intrinsic to perform a Gaussian blur with a 25-pixel radius on the allocation. The default implementation of blur uses carefully hand-tuned assembly code, but on some hardware it will instead use hand-tuned GPU code.
What do developers get from the tuning that we’ve done? On the new Nexus 7, running that same 25-pixel radius Gaussian blur on a 1.6 megapixel image takes about 176ms. A simpler intrinsic like the color matrix operation takes under 4ms. The intrinsics are typically 2-3x faster than a multithreaded C implementation and often 10x+ faster than a Java implementation. Pretty good for eight lines of code.
Figure 1. Performance gains with RenderScript intrinsics, relative to equivalent multithreaded C implementations.
Applications that need additional functionality can mix these intrinsics with their own RenderScript kernels. An example of this would be an application that is taking camera preview data, converting it from YUV to RGB, adding a vignette effect, and uploading the final image to a SurfaceView for display.
In this example, we’ve got a stream of data flowing between a source device (the camera) and an output device (the display) with a number of possible processors along the way. Today, these operations can all run on the CPU, but as architectures become more advanced, using other processors becomes possible.
For example, the vignette operation can happen on a compute-capable GPU (like the ARM Mali T604 in the Nexus 10), while the YUV to RGB conversion could happen directly on the camera’s image signal processor (ISP). Using these different processors could significantly improve power consumption and performance. As more these processors become available, future Android updates will enable RenderScript to run on these processors, and applications written for RenderScript today will begin to make use of those processors transparently, without any additional work for developers.
Intrinsics provide developers a powerful tool they can leverage with minimal effort to achieve great performance across a wide variety of hardware. They can be mixed and matched with general purpose developer code allowing great flexibility in application design. So next time you have performance issues with image manipulation, I hope you give them a look to see if they can help.