Tag Archives: progression

Open sourcing the attention center model

When you look at an image, what parts of an image do you pay attention to first? Would a machine be able to learn this? We provide a machine learning model that can be used to do just that. Why is it useful? The latest generation image format (JPEG XL) supports serving the parts that you pay attention to first, which results in an improved user experience: images will appear to load faster. But the model not only works for encoding JPEG XL images, but can be used whenever we need to know where a human would look first.

An open sourcing attention center model

What regions in an image will attract the majority of human visual attention first? We trained a model to predict such a region when given an image, called the attention center model, which is now open sourced. In addition to the model, we provide a script to use it in combination with the JPEG XL encoder: google/attention-center.

Some example predictions of our attention center model are shown in the following figure, where the green dot is the predicted attention center point for the image. Note that in the “two parrots” image both parrots’ heads are visually important, so the attention center point will be in the middle.

Four images in quadrants as follows: A red door with brass doorknob in top left quadrant, headshot of a brown skinned girl waering a colorful sweater and ribbons in her hair and painted face smiling at the camera in the top right quadrant, A teal shuttered catherdral style window against a sand colored stucco wall with pink and red hibiscus in the forefront in the bottom left quadrant, A blue and yellow macaw and red and green macaw next to each other in the bottom right quadrant
Images are from Kodak image data set: http://r0k.us/graphics/kodak/

The model is 2MB and in the TensorFlow Lite format. It takes an RGB image as input and outputs a 2D point, which is the predicted center of human attention on the image. That predicted center is the place where we should start with operations (decoding and displaying in JPEG XL case). This allows the most visually salient/import regions to be processed as early as possible. Check out the code and continue to build upon it!

Attention center ground-truth data

To train a model to predict the attention center, we first need to have some ground-truth data from the attention center. Given an image, some attention points can either be collected by eye trackers [1], or be approximated by mouse clicks on a blurry version of the image [2]. We first apply temporal filtering to those attention points and keep only the initial ones, and then apply spatial filtering to remove noise (e.g., random gazes). We then compute the center of the remaining attention points as the attention center ground-truth. An example illustration figure is shown below for the process of obtaining the ground-truth.

Five images in a row showing the original image of a person standing on a rock by the ocean; the first is the original image, the second showing gaze/attention points, the third shoing temporal filtering, the fourth spatial filtering, and fifth, attention center

Attention center model architecture

The attention center model is a deep neural net, which takes an image as input, and uses a pre-trained classification network, e.g, ResNet, MobileNet, etc., as the backbone. Several intermediate layers that output from the backbone network are used as input for the attention center prediction module. These different intermediate layers contain different information e.g., shallow layers often contain low level information like intensity/color/texture, while deeper layers usually contain higher and more semantic information like shape/object. All are useful for the attention prediction. The attention center prediction applies convolution, deconvolution and/or resizing operator together with aggregation and sigmoid function to generate a weighting map for the attention center. And then an operator (the Einstein summation operator in our case) can be applied to compute the (gravity) center from the weighting map. An L2 norm between the predicted attention center and the ground-truth attention center can be computed as the training loss.

Attention center model architecture

Progressive JPEG XL images with attention center model

JPEG XL is a new image format that allows the user to encode images in a way to ensure the more interesting parts come first. This has the advantage that when viewing images that are transferred over the web, we can already display the attention grabbing part of the image, i.e. the parts where the user looks first and as soon as the user looks elsewhere ideally the rest of the image already has arrived and has been decoded. Using Saliency in progressive JPEG XL images | Google Open Source Blog illustrates how this works in principle. In short, in JPEG XL, the image is divided into square groups (typically of size 256 x 256), and the JPEG XL encoder will choose a starting group in the image and then grow concentric squares around that group. It was this need for figuring out where the attention center of an image is that led us to open source the attention center model, together with a script to use it in combination with the JPEG XL encoder. Progressive decoding of JPEG XL images has recently been added to Chrome starting from version 107. At the moment, JPEG XL is behind an experimental flag, which can be enabled by going to chrome://flags, searching for “jxl”.

To try out how partially loaded progressive JPEG XL images look, you can go to https://google.github.io/attention-center/.

By Moritz Firsching, Junfeng He, and Zoltan Szabadka – Google Research

References

[1] Valliappan, Nachiappan, Na Dai, Ethan Steinberg, Junfeng He, Kantwon Rogers, Venky Ramachandran, Pingmei Xu et al. "Accelerating eye movement research via accurate and affordable smartphone eye tracking." Nature communications 11, no. 1 (2020): 1-12.

[2] Jiang, Ming, Shengsheng Huang, Juanyong Duan, and Qi Zhao. "Salicon: Saliency in context." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1072-1080. 2015.

Using Saliency in progressive JPEG XL images

At Google, we are working towards improving the web experience for users. Getting images delivered fast is a crucial part of the web experience and progressive images can help getting the salient parts, detected by machine learning, first. When you look at an image, you don’t immediately look at the entire image, but tend to gaze at the most interesting, or “salient”, parts of the image first. When delivering images over the web, it is now possible to organize the data in such a way that the most salient parts arrive first. Ideally you don’t even notice that some less salient parts have not yet arrived, because by the time you look at those parts they have already arrived and rendered.

We will explain how this works with the new open source image format JPEG XL, but we’ll start by taking a step back and describing how images are currently delivered and rendered on the web.

How partial images are displayed on the web

It’s important that web sites including images load quickly, because waiting for images to load causes frustration. Two techniques in particular are used to make images appear fast: One is showing an approximation of the image before all bytes of the image are transmitted, often known as “progressive image loading.” Another is making the byte size of the image smaller by using strong image compression.

What is progressive image loading?

Some image formats are implemented in a way that does not allow any kind of progressive image loading; all the bytes of the image have to be received before rendering can begin. The next, most simple, type of image loading is sometimes called “sequential image loading.” For these images, the data is organized in a way that pixels come in a particular order, typically in rows and from top to bottom.

Formats with this kind of image loading include PNG, webp, and JPEG. The JPEG format allows more sophisticated forms of progressive images. Here, we can organize the data so that it comes in multiple scans, with each scan showing more detail than the previous one.

For example, even if only approximately 15% of the data for an image is loaded, it often already has decent results. See the following images comparing no progression:

100% of bytes loaded, original image
100% of bytes loaded, original image

15% of bytes loaded, no progressive image loading
15% of bytes loaded, no progressive image loading

15% of bytes loaded, sequential image loading
15% of bytes loaded, sequential image loading

100% of bytes loaded, original image
15% of bytes loaded, progressive JPEG

In the first scan, the progressive JPEG only has a small amount of information available for the image, (e.g. only the average color of 8x8 blocks). Known as the DC-only scan, because the average color of each 8x8 block is called DC-component in the discrete cosine transform, it is the basis of JPEG image compression. Check out this computerphile video on JPEG DCT for a basic introduction. Instead of displaying an image that consists of 8x8 blocks, JPEG rendering in Chrome and Firefox choose to render the preview with some smoothing, to provide a less distracting experience.

Progressive JPEG XLs

While the quality (and therefore byte-sizes) of the individual scans in a progressive JPEG image can be controlled, the order within a scan is still top to bottom, like in a sequential JPEG. JPEG XL goes beyond that by making it possible to send the data necessary to display all details of the most salient parts first, followed by the less salient parts. For example, in a portrait, we can decide to first send the bytes for the face, and then, for the out-of-focus background.

In general, progressive JPEG XL works in the following way:
  • There is always an 8x8 downsampled image available (similar to a DC-only scan in a progressive JPEG). The decoder can display that with a nice upsampling, which gives the impression of a smoothed version of the image.
  • The image is divided into square groups (typically of size 256 x 256) and it is possible to provide an order of these groups during encoding. In particular, we can order the groups by saliency and choose an order that anticipates where the viewer might look first, while not being disturbing.
While the format allows for a very flexible order of the groups, our current encoder chooses a starting group and then grows concentric squares around that group. This is because we expect that this will be less distracting to the user. To make successive updates even less noticeable, we smooth the boundary between groups for which all the data has arrived and those that still contain an incomplete approximation. One requirement of this technique is a good way of identifying where the salient parts of an image are, which is needed when encoding an image. This information is typically represented by a saliency map which can be visualized as a heatmap image, where the more salient parts are redder.

Original image next to saliency map image
Original image.                                                                                                             Saliency map.

Smooth DC-image next to image with group border
Smooth DC-image.                                                                                                  Image with group order.

Stay tuned for videos showing progressive JPEG XL in action.

How to find good saliency maps for images

Saliency prediction models (overview) aim at predicting which regions in an image will attract human attention. To predict saliency effectively, our model leverages the power of deep neural nets to consider both high level semantic signals like face, objects, shapes etc., as well as low or medium level signals like color, intensity, texture, and so on. The model is trained on a large scale public gaze/saliency data set, to make sure the predicted saliency best mimics human gaze/fixation behaviour on each image. The model takes an image as the input and output a saliency map, which can serve as a visual importance map, and hence help determine the decoding order for each region in the image. Example images and their predicted saliency are as follows:

Example images and their predicted saliency

At the time of writing (July 2021), Chrome and Firefox did not yet support decoding JPEG XL image progressively in the way we describe, but the spec does allow encoding arbitrary group orders.

Different users have different experiences when it comes to looking at images loading on the web.We hope that this way of progressively delivering images will improve user experience especially on lower-bandwidth connections.

By Moritz Firsching and Junfeng He – Google Research

Using Saliency in progressive JPEG XL images

At Google, we are working towards improving the web experience for users. Getting images delivered fast is a crucial part of the web experience and progressive images can help getting the salient parts, detected by machine learning, first. When you look at an image, you don’t immediately look at the entire image, but tend to gaze at the most interesting, or “salient”, parts of the image first. When delivering images over the web, it is now possible to organize the data in such a way that the most salient parts arrive first. Ideally you don’t even notice that some less salient parts have not yet arrived, because by the time you look at those parts they have already arrived and rendered.

We will explain how this works with the new open source image format JPEG XL, but we’ll start by taking a step back and describing how images are currently delivered and rendered on the web.

How partial images are displayed on the web

It’s important that web sites including images load quickly, because waiting for images to load causes frustration. Two techniques in particular are used to make images appear fast: One is showing an approximation of the image before all bytes of the image are transmitted, often known as “progressive image loading.” Another is making the byte size of the image smaller by using strong image compression.

What is progressive image loading?

Some image formats are implemented in a way that does not allow any kind of progressive image loading; all the bytes of the image have to be received before rendering can begin. The next, most simple, type of image loading is sometimes called “sequential image loading.” For these images, the data is organized in a way that pixels come in a particular order, typically in rows and from top to bottom.

Formats with this kind of image loading include PNG, webp, and JPEG. The JPEG format allows more sophisticated forms of progressive images. Here, we can organize the data so that it comes in multiple scans, with each scan showing more detail than the previous one.

For example, even if only approximately 15% of the data for an image is loaded, it often already has decent results. See the following images comparing no progression:

100% of bytes loaded, original image
100% of bytes loaded, original image

15% of bytes loaded, no progressive image loading
15% of bytes loaded, no progressive image loading

15% of bytes loaded, sequential image loading
15% of bytes loaded, sequential image loading

100% of bytes loaded, original image
15% of bytes loaded, progressive JPEG

In the first scan, the progressive JPEG only has a small amount of information available for the image, (e.g. only the average color of 8x8 blocks). Known as the DC-only scan, because the average color of each 8x8 block is called DC-component in the discrete cosine transform, it is the basis of JPEG image compression. Check out this computerphile video on JPEG DCT for a basic introduction. Instead of displaying an image that consists of 8x8 blocks, JPEG rendering in Chrome and Firefox choose to render the preview with some smoothing, to provide a less distracting experience.

Progressive JPEG XLs

While the quality (and therefore byte-sizes) of the individual scans in a progressive JPEG image can be controlled, the order within a scan is still top to bottom, like in a sequential JPEG. JPEG XL goes beyond that by making it possible to send the data necessary to display all details of the most salient parts first, followed by the less salient parts. For example, in a portrait, we can decide to first send the bytes for the face, and then, for the out-of-focus background.

In general, progressive JPEG XL works in the following way:
  • There is always an 8x8 downsampled image available (similar to a DC-only scan in a progressive JPEG). The decoder can display that with a nice upsampling, which gives the impression of a smoothed version of the image.
  • The image is divided into square groups (typically of size 256 x 256) and it is possible to provide an order of these groups during encoding. In particular, we can order the groups by saliency and choose an order that anticipates where the viewer might look first, while not being disturbing.
While the format allows for a very flexible order of the groups, our current encoder chooses a starting group and then grows concentric squares around that group. This is because we expect that this will be less distracting to the user. To make successive updates even less noticeable, we smooth the boundary between groups for which all the data has arrived and those that still contain an incomplete approximation. One requirement of this technique is a good way of identifying where the salient parts of an image are, which is needed when encoding an image. This information is typically represented by a saliency map which can be visualized as a heatmap image, where the more salient parts are redder.

Original image next to saliency map image
Original image.                                                                                                             Saliency map.

Smooth DC-image next to image with group border
Smooth DC-image.                                                                                                  Image with group order.

Stay tuned for videos showing progressive JPEG XL in action.

How to find good saliency maps for images

Saliency prediction models (overview) aim at predicting which regions in an image will attract human attention. To predict saliency effectively, our model leverages the power of deep neural nets to consider both high level semantic signals like face, objects, shapes etc., as well as low or medium level signals like color, intensity, texture, and so on. The model is trained on a large scale public gaze/saliency data set, to make sure the predicted saliency best mimics human gaze/fixation behaviour on each image. The model takes an image as the input and output a saliency map, which can serve as a visual importance map, and hence help determine the decoding order for each region in the image. Example images and their predicted saliency are as follows:

Example images and their predicted saliency

At the time of writing (July 2021), Chrome and Firefox did not yet support decoding JPEG XL image progressively in the way we describe, but the spec does allow encoding arbitrary group orders.

Different users have different experiences when it comes to looking at images loading on the web.We hope that this way of progressively delivering images will improve user experience especially on lower-bandwidth connections.

By Moritz Firsching and Junfeng He – Google Research