Tag Archives: Adam Bender

SMURF: Beyond the Test Pyramid

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Adam Bender

The test pyramid is the canonical heuristic for guiding test suite evolution. It conveys a simple message - prefer more unit tests than integration tests, and prefer more integration tests than end-to-end tests.

A diagram of the test pyramid

While useful, the test pyramid lacks the details you need as your test suite grows and you face challenging trade-offs. To scale your test suite, go beyond the test pyramid.

The SMURF mnemonic is an easy way to remember the tradeoffs to consider when balancing your test suite:

Speed: Unit tests are faster than other test types and can be run more often—you’ll catch problems sooner.
Maintainability: The aggregated cost of debugging and maintaining tests (of all types) adds up quickly. A larger system under test has more code, and thus greater exposure to dependency churn and requirement drift which, in turn, creates more maintenance work.
Utilization: Tests that use fewer resources (memory, disk, CPU) cost less to run. A good test suite optimizes resource utilization so that it does not grow super-linearly with the number of tests. Unit tests usually have better utilization characteristics, often because they use test doubles or only involve limited parts of a system.
Reliability: Reliable tests only fail when an actual problem has been discovered. Sorting through flaky tests for problems wastes developer time and costs resources in rerunning the tests. As the size of a system and its corresponding tests grow, non-determinism (and thus, flakiness) creeps in, and your test suite is more likely to become unreliable.
Fidelity: High-fidelity tests come closer to approximating real operating conditions (e.g., real databases or traffic loads) and better predict the behavior of our production systems. Integration and end-to-end tests can better reflect realistic conditions, while unit tests have to simulate the environment, which can lead to drift between test expectations and reality.

A radar chart depicting the relationship between SMURF attributes as applied to unit, integration, and end-to-end tests. Unit tests perform best on all attributes except fidelity, where they are the worst. Integration tests are mid-way performers on all aspects. End-to-end tests are worst on all aspects, except fidelity where they are the best.

A radar chart of Test Type vs. Test Property (i.e. SMURF). Farther from center is better.

In many cases, the relationships between the SMURF dimensions are in tension: improving one dimension can affect the others. However, if you can improve one or more dimensions of a test without harming the others, then you should do so. When thinking about the types of your tests (unit, integration, end-to-end), your choices have meaningful implications for your test suite’s cost and the value it provides.

Source: Google Testing Blog

Code Coverage Best Practices

By Carlos Arguelles, Marko Ivanković‎, and Adam Bender

We have spent several decades driving software testing initiatives in various very large software companies. One of the areas that we have consistently advocated for is the use of code coverage data to assess risk and identify gaps in testing. However, the value of code coverage is a highly debated subject with strong opinions, and a surprisingly polarizing topic. Every time code coverage is mentioned in any large group of people, seemingly endless arguments ensue. These tend to lead the conversation away from any productive progress, as people securely bunker in their respective camps. The purpose of this document is to give you tools to steer people on all ends of the spectrum to find common ground so that you can move forward and use coverage information pragmatically. We put forth best practices in the domain of code coverage to work effectively with code health.

Code coverage provides significant benefits to the developer workflow. It is not a perfect measure of test quality, but it does offer a reasonable, objective, industry standard metric with actionable data. It does not require significant human interaction, it applies universally to all products, and there are ample tools available in the industry for most languages. You must treat it with the understanding that it’s a lossy and indirect metric that compresses a lot of information into a single number so it should not be your only source of truth. Instead, use it in conjunction with other techniques to create a more holistic assessment of your testing efforts.
It is an open research question whether code coverage alone reduces defects, but our experience shows that efforts in increasing code coverage can often lead to culture changes in engineering excellence that in the long run reduce defects. For example, teams that give code coverage priority tend to treat testing as a first class citizen, and tend to bake stronger testability into their product design, so that they can achieve their testing goals with less effort. All this in turn leads to writing higher quality code to begin with (more modular, cleaner contracts in their APIs, more manageable code reviews, etc.). They also start caring more about their overall health, and engineering and operational excellence.
A high code coverage percentage does not guarantee high quality in the test coverage. Focusing on getting the number as close as possible to 100% leads to a false sense of security. It could also be wasteful, burning machine cycles and creating technical debt from low-value tests that now need to be maintained. Bad code being pushed to production due to missing tests could happen either because (a) your tests did not cover a specific path of code, a test gap that is easy to identify with code coverage analysis, or (b) because your tests did not cover a specific edge case in an area that did have code coverage, which is difficult or impossible to catch with code coverage analysis. Code coverage does not guarantee that the covered lines or branches have been tested correctly, it just guarantees that they have been executed by a test. Be mindful of copy/pasting tests just for the sake of increasing coverage, or adding tests with little actual value, to comply with the number. A better technique to assess whether you’re adequately exercising the lines your tests cover, and adequately asserting on failures, is mutation testing.
But a low code coverage number does guarantee that large areas of the product are going completely untested by automation on every single deployment. This increases our risk of pushing bad code to production, so it should receive attention. In fact a lot of the value of code coverage data is to highlight not what’s covered, but what’s not covered.
There is no “ideal code coverage number” that universally applies to all products. The level of testing you want/need for a set of code should be a function of (a) business impact/criticality of the code; (b) how often you will need to touch/change the code; (c) how much longer you expect the code to live, its complexity, and domain variables. We cannot mandate every single team should have x% code coverage; this is a business decision best made by the owners of the product with domain-specific knowledge. Any mandate to reach x% code coverage should be accompanied by infrastructure investments to make testing easy, such as integrating tools into the developer workflow. Be mindful that engineers may start treating your target like a checkbox and avoid increasing coverage beyond the target, even if doing so would be prudent.
In general code coverage of a lot of products is below the bar; we should aim at significantly improving code coverage across the board. Although there is no “ideal code coverage number,” at Google we offer the general guidelines of 60% as “acceptable”, 75% as “commendable” and 90% as “exemplary.” However we like to stay away from broad top-down mandates and encourage every team to select the value that makes sense for their business needs.
We should not be obsessing on how to get from 90% code coverage to 95%. The gains of increasing code coverage beyond a certain point are logarithmic. But we should be taking concrete steps to get from 30% to 70% and always making sure new code meets our desired threshold.
More important than the percentage of lines covered is human judgment over the actual lines of code (and behaviors) that aren’t being covered (analyzing the gaps in testing) and whether this risk is acceptable or not. What’s not covered is more meaningful than what is covered. Pragmatic discussions over specific lines of code not covered that take place during the code review process are more valuable than over-indexing on an arbitrary target number. We have found out that embedding code coverage into your code review process makes code reviews faster and easier. Not all code is equally important, for example testing debug log lines is often not as important, so when developers can see not just the coverage number, but each covered line highlighted as part of the code review, they will make sure that the most important code is covered.
Just because your product has low code coverage doesn’t mean you can’t take concrete, incremental steps to improve it over time. Inheriting a legacy system with poor testing and poor testability can be daunting, and you may not feel empowered to turn it around, or even know where to start. But at the very least, you can adopt the ‘boy-scout rule’ (leave the campground cleaner than you found it). Over time, and incrementally, you will get to a healthy location.
Make sure that frequently changing code is covered. While project wide goals above 90% are most likely not worth it, per-commit coverage goals of 99% are reasonable, and 90% is a good lower threshold. We need to ensure that our tests are not getting worse over time.
Unit test code coverage is only a piece of the puzzle. Integration/System test code coverage is important too. And the aggregate view of the coverage of all sources in your Pipeline (unit and integration) is paramount, as it gives you the bigger picture of how much of your code is not exercised by your test automation as it makes its way in your pipeline to a production environment. One thing you should be aware of is while unit tests have high correlation between executed and evaluated code, some of the coverage from integration tests and end-to-end tests is incidental and not deliberate. But incorporating code coverage from integration tests can help you avoid situations where you have a false sense of security that even though you’re not covering code in your unit tests, you think you’re covering it in your integration tests.
We should gate deployments that do not meet our code coverage standards. Teams should debate and decide which gating mechanism makes sense to them. You should however be careful that it doesn’t turn into being treated as a checkbox that is required to be filled, as it can backfire (pressure to 'hit the metric' almost never yields the desired outcome). There are many mechanisms available: gate on coverage for all code vs gate on coverage to new code only; gate on a specific hard-coded code coverage number vs gate on delta from prior version, specific parts of the code to ignore or focus on. And then, commit to upholding these as a team. Drops in code coverage violating the gate should prevent the code from being checked in and reaching production.

If you would like to learn more about Google's coverage infrastructure, we welcome you to read our paper “Coverage at Google” which can be found here.

Source: Google Testing Blog

Testing on the Toilet: What Makes a Good End-toEnd Test?

by Adam Bender

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

An end-to-end test tests your entire system from one end to the other, treating everything in between as a black box. End-to-end tests can catch bugs that manifest across your entire system. In addition to unit and integration tests, they are a critical part of a balanced testing diet, providing confidence about the health of your system in a near production state. Unfortunately, end-to-end tests are slower, more flaky, and more expensive to maintain than unit or integration tests. Consider carefully whether an end-to-end test is warranted, and if so, how best to write one.

Let's consider how an end-to-end test might work for the following "login flow":

In order to be cost effective, an end-to-end test should focus on aspects of your system that cannot be reliably evaluated with smaller tests, such as resource allocation, concurrency issues and API compatibility. More specifically:

For each important use case, there should be one corresponding end-to-end test. This should include one test for each important class of error. The goal is the keep your total end-to-end count low.
Be prepared to allocate at least one week a quarter per test to keep your end-to-end tests stable in the face of issues like slow and flaky dependencies or minor UI changes.
Focus your efforts on verifying overall system behavior instead of specific implementation details; for example, when testing login behavior, verify that the process succeeds independent of the exact messages or visual layouts, which may change frequently.
Make your end-to-end test easy to debug by providing an overview-level log file, documenting common test failure modes, and preserving all relevant system state information (e.g.: screenshots, database snapshots, etc.).

End-to-end tests also come with some important caveats:

System components that are owned by other teams may change unexpectedly, and break your tests. This increases overall maintenance cost, but can highlight incompatible changes
It may be more difficult to make an end-to-end test fully hermetic; leftover test data may alter future tests and/or production systems. Where possible keep your test data ephemeral.
An end-to-end test often necessitates multiple test doubles (fakes or stubs) for underlying dependencies; they can, however, have a high maintenance burden as they drift from the real implementations over time.

googblogs.com

All Google blogs and Press in one site

Tag Archives: Adam Bender

SMURF: Beyond the Test Pyramid

Source: Google Testing Blog

Code Coverage Best Practices

Source: Google Testing Blog

Testing on the Toilet: What Makes a Good End-toEnd Test?

Source: Google Testing Blog