Category Archives: Google Testing Blog

If it ain’t broke, you’re not trying hard enough

Test Flakiness – One of the main challenges of automated testing (Part II)

By George Pirocanac

This is part two of a series on test flakiness. The first article discussed the four components under which tests are run and the possible reasons for test flakiness. This article will discuss the triage tips and remedies for flakiness for each of these possible reasons.


Components


To review, the four components where flakiness can occur include:
  • The tests themselves
  • The test-running framework
  • The application or system under test (SUT) and the services and libraries that the SUT and testing framework depend upon
  • The OS and hardware and network that the SUT and testing framework depend upon

This was captured and summarized in the following diagram.

The reasons, triage tips, and remedies for flakiness are discussed below, by component.



The tests themselves


The tests themselves can introduce flakiness. This can include test data, test workflows, initial setup of test prerequisites, and initial state of other dependencies.


Reason for Flakiness

Tips for Triaging

Type of Remedy

Improper initialization or cleanup.

Look for compiler warnings about uninitialized variables. Inspect initialization and cleanup code. Check that the environment is set up and torn down correctly. Verify that test data is correct.


Explicitly initialize all variables with proper values before their use.

Properly set up and tear down the testing environment. Consider an initial test that verifies the state of the environment.

Invalid assumptions about the state of test data.

Rerun test(s) independently.

Make tests independent of any state from other tests and previous runs.

Invalid assumptions about the state of the system, such as the system time.

Explicitly check for system dependency assumptions.

Remove or isolate the SUT dependencies on aspects of the environment that you do not control.

Dependencies on execution time, expecting asynchronous events to occur in a specific order, waiting without timeouts, or race conditions between the tests and the application.

Log the times when accesses to the application are made.


As part of debugging, introduce delays in the application to check for differences in test results.

Add synchronization elements to the tests so that they wait for specific application states. Disable unnecessary caching to have a predictable timeline for the application responses.

Note: Do NOT add arbitrary delays as these can become flaky again over time and slow down the test unnecessarily.

Dependencies on the order in which the tests are run. (Similar to the second case above.)

Rerun test(s) independently.

Make tests independent of each other and of  any state from previous runs.


Table 1 - Reasons, triage tips, and remedies for flakiness in the tests themselves

The test-running framework


An unreliable test-running framework can introduce flakiness. 


Reason for Flakiness

Tips for Triaging

Type of Remedy

Failure to allocate enough resources for the SUT, thus preventing it from running.

Check logs to see if SUT came up.

Allocate sufficient resources.

Improper scheduling of the tests so they “collide” and cause each other to fail.

Explicitly run tests independently in different order.

Make tests runnable independently of each other.

Insufficient system resources to satisfy the test requirements. (Similar to the first case but here resources are consumed while running the workflow.)

Check system logs to see if SUT ran out of resources.

Fix memory leaks or similar resource “bleeding.”


Allocate sufficient resources to run tests.


Table 2 - Reasons, triage tips, and remedies for flakiness in the test running framework


The application or SUT and the services and libraries that the SUT and testing framework depend upon


Of course, the application itself (or the SUT) could be the source of flakiness. 
An application can also have numerous dependencies on other services, and each of those services can have their own dependencies. In this chain, each of the services can introduce flakiness. 


Reason for Flakiness

Tips for Triaging

Type of Remedy

Race conditions.

Log accesses of shared resources.

Add synchronization elements to the tests so that they wait for specific application states. Note: Do NOT add arbitrary delays as these can become flaky again over time.

Uninitialized variables.

Look for compiler warnings about uninitialized variables.


Explicitly initialize all variables with proper values before their use.

Being slow to respond or being unresponsive to the stimuli from the tests.

Log the times when requests and responses are made.

Check and remove any causes for delays.

Memory leaks.

Look at memory consumption during test runs. Use tools such as Valgrind to detect.

Fix programming error causing memory leak. This Wikipedia article has an excellent discussion on these types of errors.

Oversubscription of resources.

Check system logs to see if SUT ran out of resources.

Allocate sufficient resources to run tests.


Changes to the application (or dependent services) out of sync with the corresponding tests.

Examine revision history.

Institute a policy requiring code changes to be accompanied by tests.


Table 3 - Reasons, triage tips, and remedies for flakiness in the application or SUT


The OS and hardware that the SUT and testing framework depend upon


Finally, the underlying hardware and operating system can be sources of test flakiness. 


Reason for Flakiness

Tips for Triaging

Type of Remedy

Networking failures or instability.

Check for hardware errors in system logs.

Fix hardware errors or run tests on different hardware.

Disk errors.

Check for hardware errors in system logs.

Fix hardware errors or run tests on different hardware.

Resources being consumed by other tasks/services not related to the tests being run.


Examine system process activity.

Reduce activity of other processes on test system(s).


Table 4 - Reasons, triage tips, and remedies for flakiness in the OS and hardware of the SUT


Conclusion

As can be seen from the wide variety of failures, having low flakiness in automated testing can be quite a challenge. This article has outlined both the components under which tests are run and the types of flakiness that can occur, and thus can serve as a cheat sheet when triaging and fixing flaky tests.


References











Test Flakiness – One of the main challenges of automated testing

By George Pirocanac


Dealing with test flakiness is a critical skill in testing because automated tests that do not provide a consistent signal will slow down the entire development process. If you haven’t encountered flaky tests, this article is a must-read as it first tries to systematically outline the causes for flaky tests. If you have encountered flaky tests, see how many fall into the areas listed.


A follow-up article will talk about dealing with each of the causes.


Over the years I’ve seen a lot of reasons for flaky tests, but rather than review them one by one, let’s group the sources of flakiness by the components under which tests are run:
  • The tests themselves
  • The test-running framework
  • The application or system under Test (SUT) and the services and libraries that the SUT and testing framework depend upon
  • The OS and hardware that the SUT and testing framework depend upon

This is illustrated below. Figure 1 first shows the hardware/software stack that supports an application or system under test. At the lowest level is the hardware. The next level up is the operating system followed by the libraries that provide an interface to the system. At the highest level, is the middleware, the layer that provides application specific interfaces.



In a distributed system, however, each of the services of the application and the services it depends upon can reside on a different hardware / software stack as can the test running service. This is illustrated in Figure 2 as the full test running environment.




As discussed above, each of these components is a potential area for flakiness.


The tests themselves


The tests themselves can introduce flakiness. Typical causes include:
  • Improper initialization or cleanup.
  • Invalid assumptions about the state of test data.
  • Invalid assumptions about the state of the system. An example can be the system time.
  • Dependencies on the timing of the application.
  • Dependencies on the order in which the tests are run. (Similar to the first case above.)


The test-running framework


An unreliable test-running framework can introduce flakiness. Typical causes include:

  • Failure to allocate enough resources for the system under test thus causing it to fail coming up. 
  • Improper scheduling of the tests so they “collide” and cause each other to fail.
  • Insufficient system resources to satisfy the test requirements.

The application or system under test and the services and libraries that the SUT and testing framework depend upon


Of course, the application itself (or the system under test) could be the source of flakiness. An application can also have numerous dependencies on other services, and each of those services can have their own dependencies. In this chain, each of the services can introduce flakiness. Typical causes include:
  • Race conditions.
  • Uninitialized variables.
  • Being slow to respond or being unresponsive to the stimuli from the tests.
  • Memory leaks.
  • Oversubscription of resources.
  • Changes to the application (or dependent services) happening at a different pace than those to the corresponding tests.

Testing environments are called hermetic when they contain everything that is needed to run the tests (i.e. no external dependencies like servers running in production). Hermetic environments, in general, are less likely to be flaky.

The OS and hardware that the SUT and testing framework depend upon



Finally, the underlying hardware and operating system can be the source of test flakiness. Typical causes include:
  • Networking failures or instability.
  • Disk errors.
  • Resources being consumed by other tasks/services not related to the tests being run.

As can be seen from the wide variety of failures, having low flakiness in automated testing can be quite a challenge. This article has both outlined the areas and the types of flakiness that can occur in those areas, so it can serve as a cheat sheet when triaging flaky tests.


In the follow-up of this blog we’ll look at ways of addressing these issues.


References




Testing on the Toilet: Separation of Concerns? That’s a Wrap!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


By Stefan Kennedy


The following function decodes a byte array as an image using an API named SpeedyImg. What maintenance problems might arise due to referencing an API owned by a different team?

SpeedyImgImage decodeImage(List<SpeedyImgDecoder> decoders, byte[] data) {
SpeedyImgOptions options = getDefaultConvertOptions();
for (SpeedyImgDecoder decoder : decoders) {
SpeedyImgResult decodeResult = decoder.decode(decoder.formatBytes(data));
SpeedyImgImage image = decodeResult.getImage(options);
if (validateGoodImage(image)) { return image; }
}
throw new RuntimeException();
}



Details about how to call the API are mixed with domain logic, which can make the code harder to understand. For example, the call to decoder.formatBytes() is required by the API, but how the bytes are formatted isn’t relevant to the domain logic.


Additionally, if this API is used in many places across a codebase, then all usages may need to change if the way the API is used changes. For example, if the return type of this function is changed to the more generic SpeedyImgResult type, usages of SpeedyImgImage would need to be updated.


To avoid these maintenance problems, create wrapper types to hide API details behind an abstraction:

Image decodeImage(List<ImageDecoder> decoders, byte[] data) {
for (ImageDecoder decoder : decoders) {
Image decodedImage = decoder.decode(data);
if (validateGoodImage(decodedImage)) { return decodedImage; }
}
throw new RuntimeException();
}


Wrapping an external API follows the Separation of Concerns principle, since the logic for how the API is called is separated from the domain logic. This has many benefits, including:
  • If the way the API is used changes, encapsulating the API in a wrapper insulates how far those changes can propagate across your codebase.
  • You can modify the interface or the implementation of types you own, but you can’t for API types.
  • It is easier to switch or add another API, since they can still be represented by the introduced types (e.g. ImageDecoder/Image).
  • Readability can improve as you don’t need to sift through API code to understand core logic.

Not all external APIs need to be wrapped. For example, if an API would take a huge effort to separate or is simple enough that it doesn't pollute the codebase, it may be better not to introduce wrapper types (e.g. library types like List in Java or std::vector in C++). When in doubt, keep in mind that a wrapper should only be added if it will clearly improve the code (see the YAGNI principle).


“Separation of Concerns” in the context of external APIs is also described by Martin Fowler in his blog post, Refactoring code that accesses external services


Testing on the Toilet: Separation of Concerns? That’s a Wrap!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


By Stefan Kennedy


The following function decodes a byte array as an image using an API named SpeedyImg. What maintenance problems might arise due to referencing an API owned by a different team?

SpeedyImgImage decodeImage(List<SpeedyImgDecoder> decoders, byte[] data) {
SpeedyImgOptions options = getDefaultConvertOptions();
for (SpeedyImgDecoder decoder : decoders) {
SpeedyImgResult decodeResult = decoder.decode(decoder.formatBytes(data));
SpeedyImgImage image = decodeResult.getImage(options);
if (validateGoodImage(image)) { return image; }
}
throw new RuntimeException();
}



Details about how to call the API are mixed with domain logic, which can make the code harder to understand. For example, the call to decoder.formatBytes() is required by the API, but how the bytes are formatted isn’t relevant to the domain logic.


Additionally, if this API is used in many places across a codebase, then all usages may need to change if the way the API is used changes. For example, if the return type of this function is changed to the more generic SpeedyImgResult type, usages of SpeedyImgImage would need to be updated.


To avoid these maintenance problems, create wrapper types to hide API details behind an abstraction:

Image decodeImage(List<ImageDecoder> decoders, byte[] data) {
for (ImageDecoder decoder : decoders) {
Image decodedImage = decoder.decode(data);
if (validateGoodImage(decodedImage)) { return decodedImage; }
}
throw new RuntimeException();
}


Wrapping an external API follows the Separation of Concerns principle, since the logic for how the API is called is separated from the domain logic. This has many benefits, including:
  • If the way the API is used changes, encapsulating the API in a wrapper insulates how far those changes can propagate across your codebase.
  • You can modify the interface or the implementation of types you own, but you can’t for API types.
  • It is easier to switch or add another API, since they can still be represented by the introduced types (e.g. ImageDecoder/Image).
  • Readability can improve as you don’t need to sift through API code to understand core logic.

Not all external APIs need to be wrapped. For example, if an API would take a huge effort to separate or is simple enough that it doesn't pollute the codebase, it may be better not to introduce wrapper types (e.g. library types like List in Java or std::vector in C++). When in doubt, keep in mind that a wrapper should only be added if it will clearly improve the code (see the YAGNI principle).


“Separation of Concerns” in the context of external APIs is also described by Martin Fowler in his blog post, Refactoring code that accesses external services


Testing on the Toilet: Separation of Concerns? That’s a Wrap!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


By Stefan Kennedy


The following function decodes a byte array as an image using an API named SpeedyImg. What maintenance problems might arise due to referencing an API owned by a different team?

SpeedyImgImage decodeImage(List<SpeedyImgDecoder> decoders, byte[] data) {
SpeedyImgOptions options = getDefaultConvertOptions();
for (SpeedyImgDecoder decoder : decoders) {
SpeedyImgResult decodeResult = decoder.decode(decoder.formatBytes(data));
SpeedyImgImage image = decodeResult.getImage(options);
if (validateGoodImage(image)) { return image; }
}
throw new RuntimeException();
}



Details about how to call the API are mixed with domain logic, which can make the code harder to understand. For example, the call to decoder.formatBytes() is required by the API, but how the bytes are formatted isn’t relevant to the domain logic.


Additionally, if this API is used in many places across a codebase, then all usages may need to change if the way the API is used changes. For example, if the return type of this function is changed to the more generic SpeedyImgResult type, usages of SpeedyImgImage would need to be updated.


To avoid these maintenance problems, create wrapper types to hide API details behind an abstraction:

Image decodeImage(List<ImageDecoder> decoders, byte[] data) {
for (ImageDecoder decoder : decoders) {
Image decodedImage = decoder.decode(data);
if (validateGoodImage(decodedImage)) { return decodedImage; }
}
throw new RuntimeException();
}


Wrapping an external API follows the Separation of Concerns principle, since the logic for how the API is called is separated from the domain logic. This has many benefits, including:
  • If the way the API is used changes, encapsulating the API in a wrapper insulates how far those changes can propagate across your codebase.
  • You can modify the interface or the implementation of types you own, but you can’t for API types.
  • It is easier to switch or add another API, since they can still be represented by the introduced types (e.g. ImageDecoder/Image).
  • Readability can improve as you don’t need to sift through API code to understand core logic.

Not all external APIs need to be wrapped. For example, if an API would take a huge effort to separate or is simple enough that it doesn't pollute the codebase, it may be better not to introduce wrapper types (e.g. library types like List in Java or std::vector in C++). When in doubt, keep in mind that a wrapper should only be added if it will clearly improve the code (see the YAGNI principle).


“Separation of Concerns” in the context of external APIs is also described by Martin Fowler in his blog post, Refactoring code that accesses external services


Fixing a Test Hourglass

By Alan Myrvold


Automated tests make it safer and faster to create new features, fix bugs, and refactor code. When planning the automated tests, we envision a pyramid with a strong foundation of small unit tests, some well designed integration tests, and a few large end-to-end tests. From Just Say No to More End-to-End Tests, tests should be fast, reliable, and specific; end-to-end tests, however, are often slow, unreliable, and difficult to debug.


As software projects grow, often the shape of our test distribution becomes undesirable, either top heavy (no unit or medium integration tests), or like an hourglass.


The hourglass test distribution has a large set of unit tests, a large set of end-to-end tests, and few or no medium integration tests.


                      

To transform the hourglass back into a pyramid — so that you can test the integration of components in a reliable, sustainable way — you need to figure out how to architect the system under test and test infrastructure and make system testability improvements and test-code improvements.


I worked on a project with a web UI, a server, and many backends. There were unit tests at all levels with good coverage and a quickly increasing set of end-to-end tests.


The end-to-end tests found issues that the unit tests missed, but they ran slowly, and environmental issues caused spurious failures, including test data corruption. In addition, some functional areas were difficult to test because they covered more than the unit but required state within the system that was hard to set up.




We eventually found a good test architecture for faster, more reliable integration tests, but with some missteps along the way.

An example UI-level end-to-end test, written in protractor, looked something like this:


describe('Terms of service are handled', () => {
it('accepts terms of service', async () => {
const user = getUser('termsNotAccepted');
await login(user);
await see(termsOfServiceDialog());
await click('Accept')
await logoff();
await login(user);
await not.see(termsOfServiceDialog());
});
});


This test logs on as a user, sees the terms of service dialog that the user needs to accept, accepts it, then logs off and logs back on to ensure the user is not prompted again.


This terms of service test was a challenge to run reliably, because once an agreement was accepted, the backend server had no RPC method to reverse the operation and “un-accept” the TOS. We could create a new user with each test, but that was time consuming and hard to clean up.


The first attempt to make the terms of service feature testable without end-to-end testing was to hook the server RPC method and set the expectations within the test. The hook intercepts the RPC call and provides expected results instead of calling the backend API.




This approach worked. The test interacted with the backend RPC without really calling it, but it cluttered the test with extra logic.


describe('Terms of service are handled', () => {
it('accepts terms of service', async () => {
const user = getUser('someUser');
await hook('TermsOfService.Get()', true);
await login(user);
await see(termsOfServiceDialog());
await click('Accept')
await logoff();
await hook('TermsOfService.Get()', false);
await login(user);
await not.see(termsOfServiceDialog());
});
});



The test met the goal of testing the integration of the web UI and server, but it was unreliable. As the system scaled under load, there were several server processes and no guarantee that the UI would access the same server for all RPC calls, so the hook might be set in one server process and the UI accessed in another. 


The hook also wasn't at a natural system boundary, which made it require more maintenance as the system evolved and code was refactored.


The next design of the test architecture was to fake the backend that eventually processes the terms of service call.


The fake implementation can be quite simple:

public class FakeTermsOfService implements TermsOfService.Service {
private static final Map<String, Boolean> accepted = new ConcurrentHashMap<>();

@Override
public TosGetResponse get(TosGetRequest req) {
return accepted.getOrDefault(req.UserID(), Boolean.FALSE);
}

@Override
public void accept(TosAcceptRequest req) {
accepted.put(req.UserID(), Boolean.TRUE);
}
}



And the test is now uncluttered by the expectations:

describe('Terms of service are handled', () => {
  it('accepts terms of service', async () => {
const user = getUser('termsNotAccepted');
await login(user);
await see(termsOfServiceDialog());
await click('Accept')
await logoff();
await login(user);
await not.see(termsOfServiceDialog());
});
});


Because the fake stores the accepted state in memory, there is no need to reset the state for the next test iteration; it is enough just to restart the fake server.


This worked but was problematic when there was a mix of fake and real backends. This was because there was state between the real backends that was now out of sync with the fake backend.


Our final, successful integration test architecture was to provide fake implementations for all except one of the backends, all sharing the same in-memory state. One real backend was included in the system under test because it was tightly coupled with the Web UI. Its dependencies were all wired to fake backends. These are integration tests over the entire system under test, but they remove the backend dependencies. These tests expand the medium size tests in the test hourglass, allowing us to have fewer end-to-end tests with real backends.


Note that these integration tests are not only the option. For logic in the Web UI, we can write page level unit tests, which allow the tests to run faster and more reliably. For the terms of service feature, however, we want to test the Web UI and server logic together, so integration tests are a good solution.





This resulted in UI tests that ran, unmodified, on both the real and fake backend systems. 


When run with fake backends the tests were faster and more reliable. This made it easier to add test scenarios that would have been more challenging to set up with the real backends. We also deleted end-to-end tests that were well duplicated by the integration tests, resulting in more integration tests than end-to-end tests.



By iterating, we arrived at a sustainable test architecture for the integration tests.


If you're facing a test hourglass the test architecture to devise medium tests may not be obvious. I'd recommend experimenting, dividing the system on well defined interfaces, and making sure the new tests are providing value by running faster and more reliably or by unlocking hard to test areas.


References



Testing on the Toilet: Testing UI Logic? Follow the User!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Carlos Israel Ortiz García


After years of anticipation, you're finally able to purchase Google's hottest new product, gShoe*. But after clicking the "Buy" button, nothing happened! Inspecting the HTML, you notice the problem:

<button disabled=”true” click=”$handleBuyClick(data)”>Buy</button>

Users couldn’t buy their gShoes because the “Buy” button was disabled. The problem was due to the unit test for handleBuyClick, which passed even though the user interface had a bug:

it('submits purchase request', () => {
controller = new PurchasePage();
// Call the method that handles the "Buy" button click
controller.handleBuyClick(data);
expect(service).toHaveBeenCalledWith(expectedData);
});

In the above example, the test failed to detect the bug because it bypassed the UI element and instead directly invoked the "Buy" button click handler. To be effective, tests for UI logic should interact with the components on the page as a browser would, which allows testing the behavior that the end user experiences. Writing tests against UI components rather than calling handlers directly faithfully simulates user interactions (e.g., add items to a shopping cart, click a purchase button, or verify an element is visible on the page), making the tests more comprehensive.


The test for the “Buy” button should instead exercise the entire UI component by interacting with the HTML element, which would have caught the disabled button issue:

it('submits purchase request', () => {
// Renders the page with the “Buy” button and its associated code.
render(PurchasePage);
// Tries to click the button, fails the test, and catches the bug!
buttonWithText('Buy').dispatchEvent(new Event(‘click’));
expect(service).toHaveBeenCalledWith(expectedData);
});


Why should tests be written this way? Unlike end-to-end tests, tests for individual UI components don’t require a backend server or the entire app to be rendered. Instead, these  tests run in the same self-contained environment and take a similar amount of time to execute as unit tests that just execute the underlying event handlers directly. Therefore, the UI acts as the public API, leaving the business logic as an implementation detail (also known as the "Use the Front Door First" principle), resulting in better coverage of a feature.

Disclaimer: “gShoe” is not a real Google product. Unfortunately you can’t buy a pair even if the bug is fixed!

Testing on the Toilet: Avoid Hardcoding Values for Better Libraries

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Adel Saoud


You may have been in a situation where you're using a value that always remains the same, so you define a constant. This can be a good practice as it removes magic values and improves code readability. But be mindful that hardcoding values can make usability and potential refactoring significantly harder.

Consider the following function that relies on hardcoded values:
// Declared in the module.
constexpr int kThumbnailSizes[] = {480, 576, 720};

// Returns thumbnails of various sizes for the given image.
std::vector<Image> GetThumbnails(const Image& image) {
std::vector<Image> thumbnails;
for (const int size : kThumbnailSizes) {
thumbnails.push_back(ResizeImage(image, size));
}
return thumbnails;
}


Using hardcoded values can make your code:
  • Less predictable: The caller might not expect the function to be relying on hardcoded values outside its parameters; a user of the function shouldn’t need to read the function’s code to know that. Also, it is difficult to predict the product/resource/performance implications of changing these hardcoded values.
  • Less reusable: The caller is not able to call the function with different values and is stuck with the hardcoded values. If the caller doesn’t need all these sizes or needs a different size, the function has to be forked or refactored to avoid aforementioned complications with existing callers.

When designing a library, prefer to pass required values, such as through a function call or a constructor. The code above can be improved as follows:
std::vector<Image> GetThumbnails(const Image& image, absl::Span<const int> sizes) {
std::vector<Image> thumbnails;
for (const int size : sizes) {
thumbnails.push_back(ResizeImage(image, size));
}
return thumbnails;
}


If most of the callers use the same value for a certain parameter, make your code configurable so that this value doesn't need to be duplicated by each caller. For example, you can define a public constant that contains a commonly used value, or use default arguments in languages that support this feature (e.g. C++ or Python).
// Declared in the public header.
inline constexpr int kDefaultThumbnailSizes[] = {480, 576, 720};

// Default argument allows the function to be used without specifying a size.
std::vector<Image> GetThumbnails(const Image& image,
absl::Span<const int> sizes = kDefaultThumbnailSizes);

Code Coverage Best Practices

By Carlos Arguelles, Marko Ivanković‎, and Adam Bender


We have spent several decades driving software testing initiatives in various very large software companies. One of the areas that we have consistently advocated for is the use of code coverage data to assess risk and identify gaps in testing. However, the value of code coverage is a highly debated subject with strong opinions, and a surprisingly polarizing topic. Every time code coverage is mentioned in any large group of people, seemingly endless arguments ensue. These tend to lead the conversation away from any productive progress, as people securely bunker in their respective camps. The purpose of this document is to give you tools to steer people on all ends of the spectrum to find common ground so that you can move forward and use coverage information pragmatically. We put forth best practices in the domain of code coverage to work effectively with code health.

  • Code coverage provides significant benefits to the developer workflow. It is not a perfect measure of test quality, but it does offer a reasonable, objective, industry standard metric with actionable data. It does not require significant human interaction, it applies universally to all products, and there are ample tools available in the industry for most languages. You must treat it with the understanding that it’s a lossy and indirect metric that compresses a lot of information into a single number so it should not be your only source of truth.  Instead, use it in conjunction with other techniques to create a more holistic assessment of your testing efforts.
  • It is an open research question whether code coverage alone reduces defects, but our experience shows that efforts in increasing code coverage can often lead to culture changes in engineering excellence that in the long run reduce defects. For example, teams that give code coverage priority tend to treat testing as a first class citizen, and tend to bake stronger testability into their product design, so that they can achieve their testing goals with less effort. All this in turn leads to writing higher quality code to begin with (more modular, cleaner contracts in their APIs, more manageable code reviews, etc.). They also start caring more about their overall health, and engineering and operational excellence.
  • A high code coverage percentage does not guarantee high quality in the test coverage. Focusing on getting the number as close as possible to 100% leads to a false sense of security. It could also be wasteful, burning machine cycles and creating technical debt from low-value tests that now need to be maintained. Bad code being pushed to production due to missing tests could happen either because (a) your tests did not cover a specific path of code, a test gap that is easy to identify with code coverage analysis, or (b) because your tests did not cover a specific edge case in an area that did have code coverage, which is difficult or impossible to catch with code coverage analysis. Code coverage does not guarantee that the covered lines or branches have been tested correctly, it just guarantees that they have been executed by a test. Be mindful of copy/pasting tests just for the sake of increasing coverage, or adding tests with little actual value, to comply with the number. A better technique to assess whether you’re adequately exercising the lines your tests cover, and adequately asserting on failures, is mutation testing.
  • But a low code coverage number does guarantee that large areas of the product are going completely untested by automation on every single deployment. This increases our risk of pushing bad code to production, so it should receive attention. In fact a lot of the value of code coverage data is to highlight not what’s covered, but what’s not covered.
  • There is no “ideal code coverage number” that universally applies to all products. The level of testing you want/need for a set of code should be a function of (a) business impact/criticality of the code; (b) how often you will need to touch/change the code; (c) how much longer you expect the code to live, its complexity, and domain variables. We cannot mandate every single team should have x% code coverage; this is a business decision best made by the owners of the product with domain-specific knowledge. Any mandate to reach x% code coverage should be accompanied by infrastructure investments to make testing easy, such as integrating tools into the developer workflow. Be mindful that engineers may start treating your target like a checkbox and avoid increasing coverage beyond the target, even if doing so would be prudent.
  • In general code coverage of a lot of products is below the bar; we should aim at significantly improving code coverage across the board. Although there is no “ideal code coverage number,” at Google we offer the general guidelines of 60% as “acceptable”, 75% as “commendable” and 90% as “exemplary.” However we like to stay away from broad top-down mandates and encourage every team to select the value that makes sense for their business needs.
  • We should not be obsessing on how to get from 90% code coverage to 95%. The gains of increasing code coverage beyond a certain point are logarithmic. But we should be taking concrete steps to get from 30% to 70% and always making sure new code meets our desired threshold.
  • More important than the percentage of lines covered is human judgment over the actual lines of code (and behaviors)  that aren’t being covered (analyzing the gaps in testing) and whether this risk is acceptable or not. What’s not covered is more meaningful than what is covered. Pragmatic discussions over specific lines of code not covered that take place during the code review process are more valuable than over-indexing on an arbitrary target number. We have found out that embedding code coverage into your code review process makes code reviews faster and easier. Not all code is equally important, for example testing debug log lines is often not as important, so when developers can see not just the coverage number, but each covered line highlighted as part of the code review, they will make sure that the most important code is covered. 
  • Just because your product has low code coverage doesn’t mean you can’t take concrete, incremental steps to improve it over time. Inheriting a legacy system with poor testing and poor testability can be daunting, and you may not feel empowered to turn it around, or even know where to start. But at the very least, you can adopt the ‘boy-scout rule’ (leave the campground cleaner than you found it). Over time, and incrementally, you will get to a healthy location.
  • Make sure that frequently changing code is covered. While project wide goals above 90% are most likely not worth it, per-commit coverage goals of 99% are reasonable, and 90% is a good lower threshold. We need to ensure that our tests are not getting worse over time.
  • Unit test code coverage is only a piece of the puzzle. Integration/System test code coverage is important too. And the aggregate view of the coverage of all sources in your Pipeline (unit and integration) is paramount, as it gives you the bigger picture of how much of your code is not exercised by your test automation as it makes its way in your pipeline to a production environment. One thing you should be aware of is while unit tests have high correlation between executed and evaluated code, some of the coverage from integration tests and end-to-end tests is incidental and not deliberate. But incorporating code coverage from integration tests can help you avoid situations where you have a false sense of security that even though you’re not covering code in your unit tests, you think you’re covering it in your integration tests.
  • We should gate deployments that do not meet our code coverage standards. Teams should debate and decide which gating mechanism makes sense to them. You should however be careful that it doesn’t turn into being treated as a checkbox that is required to be filled, as it can backfire (pressure to 'hit the metric' almost never yields the desired outcome). There are many mechanisms available:  gate on coverage for all code vs gate on coverage to new code only; gate on a specific hard-coded code coverage number vs gate on delta from prior version, specific parts of the code to ignore or focus on. And then, commit to upholding these as a team. Drops in code coverage violating the gate should prevent the code from being checked in and reaching production. 

If you would like to learn more about Google's coverage infrastructure, we welcome you to read our paper “Coverage at Google” which can be found here.

Testing on the Toilet: Don’t Mock Types You Don’t Own

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Stefan Kennedy and Andrew Trenk

The code below mocks a third-party library. What problems can arise when doing this?

// Mock a salary payment library
@Mock SalaryProcessor mockSalaryProcessor;
@Mock TransactionStrategy mockTransactionStrategy;
...
when(mockSalaryProcessor.addStrategy()).thenReturn(mockTransactionStrategy);
when(mockSalaryProcessor.paySalary()).thenReturn(TransactionStrategy.SUCCESS);
MyPaymentService myPaymentService = new MyPaymentService(mockSalaryProcessor);
assertThat(myPaymentService.sendPayment()).isEqualTo(PaymentStatus.SUCCESS);

Mocking types you don’t own can make maintenance more difficult:
  • It can make it harder to upgrade the library to a new version: The expectations of an API hardcoded in a mock can be wrong or get out of date. This may require time-consuming work to manually update your tests when upgrading the library version. In the above example, an update that changes addStrategy() to return a new type derived from TransactionStrategy (e.g. SalaryStrategy) requires the mock to be updated to return this type, even though the code under test doesn’t need to be changed since it can still reference TransactionStrategy.
  • It can make it harder to know whether a library update introduced a bug in your code: The assumptions built into mocks may get out of date as changes are made to the library, resulting in tests that pass even when the code under test has a bug. In the above example, if a library update changes paySalary() to instead return TransactionStrategy.SCHEDULED, a bug could potentially be introduced due to the code under test not handling this return value properly. However, the maintainer wouldn’t know because the mock would not return this value so the test would continue to pass.
Instead of using a mock, use the real implementation, or if that’s not feasible, use a fake implementation that is ideally provided by the library owner. This reduces the maintenance burden since the issues with mocks listed above don’t occur when using a real or fake implementation. For example:
FakeSalaryProcessor fakeProcessor = new FakeSalaryProcessor(); // Designed for tests
MyPaymentService myPaymentService = new MyPaymentService(fakeProcessor);
assertThat(myPaymentService.sendPayment()).isEqualTo(PaymentStatus.SUCCESS);

If you can’t use the real implementation and a fake implementation doesn’t exist (and library owners aren’t able to create one), create a wrapper class that calls the type, and mock this instead. This reduces the maintenance burden by avoiding mocks that rely on the signatures of the library API. For example:


@Mock MySalaryProcessor mockMySalaryProcessor; // Wraps the SalaryProcessor library
...
// Mock the wrapper class rather than the library itself
when(mockMySalaryProcessor.sendSalary()).thenReturn(PaymentStatus.SUCCESS);

MyPaymentService myPaymentService = new MyPaymentService(mockMySalaryProcessor);
assertThat(myPaymentService.sendPayment()).isEqualTo(PaymentStatus.SUCCESS);

To avoid the problems listed above, prefer to test the wrapper class with calls to the real implementation. The downsides of testing with the real implementation (e.g. tests taking longer to run) are limited only to the tests for this wrapper class rather than tests throughout your codebase.

“Don’t mock types you don’t own” is also described by Steve Freeman and Nat Pryce in their book, Growing Object Oriented Software, Guided by TestsFor more details about the downsides of overusing mocks (even for types you do own), see this Google Testing Blog post.