Category Archives: Google Testing Blog

If it ain’t broke, you’re not trying hard enough

How Much Testing is Enough?

By George Pirocanac



A familiar question every software developer and team grapples with is, “How much testing is enough to qualify a software release?” A lot depends on the type of software, its purpose, and its target audience. One would expect a far more rigorous approach to testing commercial search engne than a simple smartphone flashlight application. Yet no matter what the application, the question of how much testing is sufficient can be hard to answer in definitive terms. A better approach is to provide considerations or rules of thumb that can be used to define a qualification process and testing strategy best suited for the case at hand. The following tips provide a helpful rubric:


  • Document your process or strategy.
  • Have a solid base of unit tests.
  • Don’t skimp on integration testing.
  • Perform end-to-end testing for Critical User Journeys.
  • Understand and implement the other tiers of testing.
  • Understand your coverage of code and functionality.
  • Use feedback from the field to improve your process.

Document your process or strategy


If you are already testing your product, document the entire process. This is essential for being able to both repeat the test for a later release and to analyze it for further improvement. If this is your first release, it’s a good idea to have a written test plan or strategy. In fact, having a written test plan or strategy is something that should accompany any product design.


Have a solid base of unit tests



A great place to start is writing unit tests that accompany the code. Unit tests test the code as it is written at the functional unit level. Dependencies on external services are either mocked or faked. 

A mock has the same interface as the production dependency, but only checks that the object is used according to set expectations and/or returns test-controlled values, rather than having a full implementation of its normal functionality.

A fake, on the other hand, is a shallow implementation of the dependency but should ideally have no dependencies of it’s own. Fakes provide a wider range of functionality than mocks and should be maintained by the team providing the production version of the dependency. That way, as the dependency evolves so does the fake and the unit-test writer can be confident that the fake mirrors the functionality of the production dependency.

At many companies, including Google, there are best practices of requiring any code change to have corresponding unit test cases that pass. As the code base expands, having a body of such tests that is executed before code is submitted is an important part of catching bugs before they creep into the code base. This saves time later both in writing integration tests, debugging, and verifying fixes to existing code.


Don’t skimp on integration testing



As the codebase grows and reaches a point where numbers of functional units are available to test as a group, it’s time to have a solid base of integration tests. An integration test takes a small group of units, often only two units, and tests their behavior as a whole, verifying that they coherently work together.

Often developers think that integration tests can be deprioritized or even skipped in favor of full end-to-end tests. After all, the latter really tests the product as the user would exercise it. Yet, having a comprehensive set of integration tests is just as important as having a solid unit-test base (see the earlier Google Blog article, Fixing a test hourglass).

The reason lies in the fact that integration tests have less dependencies than full end-to-end tests. As a result, integration tests, with smaller environments to bring up, will be faster and more reliable than the full end-to-end tests with their full set of dependencies (see the earlier Google Blog article, Test Flakiness - One of the Main Challenges of Automated Testing).


Perform end-to-end testing for Critical User Journeys



The discussion thus far covers testing the product at its component level, first as individual components (unit-testing), then as groups of components and dependencies (integration testing). Now it’s time to test the product end to end as a user would use it. This is quite important because it’s not just independent features that should be tested but entire workflows incorporating a variety of features. At Google these workflows - the combination of a critical goal and the journey of tasks a user undertakes to achieve that goal - are called Critical User Journeys (CUJs). Understanding CUJs, documenting them, and then verifying them using end-to-end testing (hopefully in an automated fashion) completes the Testing Pyramid.


Understand and implement the other tiers of testing



Unit, integration, and end-to-end testing address the functional level of your product. It is important to understand the other tiers of testing, including:

  • Performance testing - Measuring the latency or throughput of your application or service.
  • Load and scalability testing - Testing your application or service under higher and higher load.
  • Fault-tolerance testing - Testing your application’s behavior as different dependencies either fail or go down entirely.
  • Security testing - Testing for known vulnerabilities in your service or application.
  • Accessibility testing - Making sure the product is accessible and usable for everyone, including people with a wide range of disabilities.
  • Localization testing - Making sure the product can be used in a particular language or region.
  • Globalization testing - Making sure the product can be used by people all over the world.
  • Privacy testing - Assessing and mitigating privacy risks in the product.
  • Usability testing - Testing for user friendliness.

Again, it is important to have these testing processes occur as early as possible in your review cycle. Smaller performance tests can detect regressions earlier and save debugging time during the end-to-end tests.


Understand your coverage of code and functionality



So far, the question of how much testing is enough, from a qualitative perspective, has been examined. Different types of tests were reviewed and the argument made that smaller and earlier is better than larger or later. Now the problem will be examined from a quantitative perspective, taking code coverage techniques into account.

Wikipedia has a great article on code coverage that outlines and discusses different types of coverage, including statement, edge, branch, and condition coverage. There are several open source tools available for measuring coverage for most of the popular programming languages such as Java, C++, Go and Python. A partial list is included in the table below:



Language Tool
Java JaCoCo
Java JCov
Java OpenClover
Python Coverage.py
C++ Bullseye
Go Built in coverage support (go -cover)
Table 1 - Open source coverage tools for different languages


Most of these tools provide a summary in percentage terms. For example, 80% code coverage means about 80% of the code is covered and about 20% of the code is uncovered. At the same time, It is important to understand that, just because you have coverage for a particular area of code, this code can still have bugs.


Another concept in coverage is called changelist coverage. Changelist coverage measures the coverage in changed or added lines. It is useful for teams that have accumulated technical debt and have low coverage in their entire codebase. These teams can institute a policy where an increase in their incremental coverage will lead to overall improvement.


So far the coverage discussion has centered around coverage of the code by tests (functions, lines, etc.). Another type of coverage is feature coverage or behavior coverage. For feature coverage, the emphasis is on identifying the committed features in a particular release and creating tests for their implementation. For behavior coverage, the emphasis is on identifying the CUJs and creating the appropriate tests to track them. Again, understanding your “uncovered” features and behaviors can be a useful metric in your understanding of the risks.



Use feedback from the field to improve your process



A very important part of understanding and improving your qualification process is the feedback received from the field once the software has been released. Having a process that tracks outages and bugs and other issues, in the form of action items to improve qualification, is critical for minimizing the risks of regressions in subsequent releases. Moreover, the action items should be such that they (1) emphasize filling the testing gap as early as possible in the qualification process and (2) address strategic issues such as the lack of testing of a particular type such as load or fault tolerance testing. And again, this is why it is important to document your qualification process so that you can reevaluate it in light of the data you obtain from the field.


Summary



Creating a comprehensive qualification process and testing strategy to answer the question “How much testing is enough?” can be a complex task. Hopefully the tips given here can help you with this. In summary:

  • Document your process or strategy.
  • Have a solid base of unit tests.
  • Don’t skimp on integration testing.
  • Perform end-to-end testing for Critical User Journeys.
  • Understand and implement the other tiers of testing.
  • Understand your coverage of code and functionality.
  • Use feedback from the field to improve your process.


References

Mutation Testing

By Goran Petrovic

History


It’s been a long-standing tradition of my team to organize hackathons twice a year. In weeks prior to the hackathon, the team gathers and brainstorms ideas for projects, ranging from improving the testing infrastructure or an existing process, to trying out a wild idea they’ve had for some time. Just before the hackathon, the team rates the accumulated ideas on a coolness-impact scale: how much fun does a project sound vs. how impactful could it potentially be; while impact is important, for hackathons, fun is non-negotiable. Then, engineers who are excited to work on some of the proposed projects subscribe and form teams. It was no different in the cold winter of 2013, where among the plethora of cool and wild ideas, one was to prototype Mutation testing.


For those who are not familiar with it, mutation testing is a method of evaluating test quality by injecting bugs into the code and seeing whether the tests detect the fault or not. The more injected bugs the tests catch, the better they are. Here’s an example:


Negating the if condition.

def checkout(cart):
if not cart.items:
throw Error("cart empty")
return checkout_internal(cart)
def checkout(cart):
if cart.items:
throw Error("cart empty")
return checkout_internal(cart)

If a test fails, we say it kills the mutant, and if no tests fail, we say that the mutant is alive.


By the end of the hackathon, mutagenesis was implemented for C++ and Python, and a prototype was born: a shell script that evaluates generated mutants in a diff (pull request) and textually reports live mutants in the files in the diff. A year passed with no work done on the project, before I started to work on it in my 20% time. I had no idea what Mutation testing was at the time, so I researched and read papers on the topic, and collected lots of ideas on what I should focus on.


From Prototype To Launch


I quickly realized that the hackathon crew did not calculate the Mutation score, the ratio of mutants detected by tests, which is a prominent metric in the research literature and the holy grail of evaluating test quality, but just enumerated live mutants. My first exposure to mutants was just running the tool on the mutagenesis code itself and trying to understand the report. I was immediately overwhelmed: after a long execution time, I was facing thousands of mutants in just a handful of files. I tried going through a couple, but after a few minutes I grew tired and moved on with my main project, which happened to be on Google Shopping. In the following months, I stayed away from my 20% project, but I kept thinking about it, bugging my colleagues and friends about the ideas I had to make mutation testing work for us. After many months of brainstorming and discussions, almost a year after the original hackathon project, I was ready to design the Mutation Testing Service.


I faced two big problems. First, I could force myself to go through lots of mutants, and maybe find a useful one that would prompt me to write a test case, but I could not force others, not even my teammates. Second, the vast majority of mutants were simply bad. Here are some examples:


Replacing division with subtraction, but in a logging statement.
log.Infof("Found %d (%.2f %%)!", e,       
float64(e)*100.0 / total)
log.Infof("Found %d (%.2f %%)!", e,
float64(e)*100.0 - total)



Appending number 1 to an error message. 
Error.create(((key + " disabled")));
Error.create(((key + " disabled") + 1));



Replacing greater than with less than when comparing length of a collection to zero.
showCart := len(cart.GetItems()) > 0
showCart := len(cart.GetItems()) < 0



Replacing the idiomatic python check for whether the module is imported or executed as main.
if (__name__ == '__main__'):
if (__name__ != '__main__'):



Changing python’s string concatenation (+) to string multiplication (*).
message = ('id ' + run_idx)
message = ('id ' * run_idx)



Changing a tuning parameter.
slo = (20 * time.Second)
slo = (20 * time.Second) + 1



Changing a network timeout, but the network layer is mocked in the tests.
_TIMEOUT = (60 * 10)
_TIMEOUT = (60 / 10)



Subtracting form -∞.
df = df.replace(
[numpy.inf, -numpy.inf],
numpy.nan
)
df = df.replace(
[numpy.inf, -numpy.inf - 1],
numpy.nan
)


Yes, the tests did not detect these mutants, but we would not want such tests anyway. Many of them would produce fragile, change-detector tests. We later settled on calling them unproductive mutants: writing tests for those mutants would make the test suite worse, not better.


I realized that I needed to suppress these types of mutants: if I reported them, nobody would use mutation testing, myself included. Most of the mutants were not useful, and that is a waste of developer attention. The onus was on me to create a better tool. I set out to try various heuristics by looking at the report and suppressing mutants that I found unproductive. I encoded the heuristics in AST (Abstract Syntax Tree) matching rules, and I dubbed the AST nodes which contained unproductive mutants as arid nodes. In the beginning, there were only a few rules in the system, but that was enough to make me feel confident that my colleagues would give it a try.


The other big issue was the sheer number of mutants. With five or more in a line, hundreds in a file, it was a challenge to display them, and even if I managed that, nobody would go through them anyway. I quickly realized that they shouldn’t: it took a lot of time for me to go through the mutants, and, while some pointed me to a hole in my test suite, most were useless, and many of them, especially ones in the same line, redundant. I did not need every possible combination of operators changed to tell me that my test for that condition was insufficient; one was just fine. That was my first decision on mutation testing: to report at most one mutant in a line. This was a quick and easy decision to make, because, if you’ve ever used a Code review system, you know that having more makes the review noisy and harder to do. Another reason why it was such an easy decision was that it would have been computationally prohibitively expensive to calculate all mutants, and I could have thrown my 20% project down the drain. I call it limitation-driven development :)


Of course, the idea was to report live mutants during Code review. Code review is the perfect time in the engineering process to surface useful findings about the code being changed, and integrating into the existing developer process has the highest chance that the developers will take action. This seemed like the most normal thing in the world: we had hundreds of analyzers and all engineers were used to receiving findings from various analyses of their code. It took an outsider to point out that this was a strange approach: mutation testing was classically run on the whole program and the mutation score calculated and used as guidance.


This is what a Mutant finding looks like in the Code review tool:




Mutation Testing at Google is a dynamic analyzer of code changes that surfaces mutants during Code review by posting code findings. In terms of infrastructure, it consists of three main parts: the change listener, the analyzer, and many mutagenesis servers, one for each language.





Each event during the Code review is announced using a publisher-subscriber pattern, and any interested party can listen, and react, to these events. When a change is sent for Code review, many things happen: linters are run, automated tests are evaluated, coverage is calculated, and for the users of mutation testing, mutants are generated and evaluated. Listening on all events coming from the Code review system, the listener schedules a mutation run on the analyzer


The first thing the analyzer does is get the code coverage results for the patch in question; from it, the analyzer can extrapolate which tests cover which lines of source code. This is a very useful piece of information, because running the minimum set of tests that can kill a mutant is crucial; if we just ran all tests that were linked in, or covered the project, that would be prohibitively computationally expensive


Next, for each covered line in each file in the patch, a mutagenesis server for the language in question is asked to produce a mutant. The mutagenesis server parses the file, traverses its AST, and applies the mutators in the requested order (as per mutation context), ignoring arid nodes, nodes in uncovered lines and in lines that are not affected by the proposed patch.


When the analyzer assembles all mutants, it patches them one by one to a version control context and then evaluates all the tests for each mutant in parallel. For mutants for which all tests pass, the analyzer surfaces a finding for the code author and reviewers, and is done for the time being.


Because the Code review is a laborious and dynamic process, with many rounds of comments from reviewers and many automated findings from hundreds of different analyzers, there can be many snapshots as the patch evolves: adoption of recommendations from reviewers or accepting proposed changes from linters yields many code states. Mutation testing first runs after coverage is available, and then it runs for each subsequent snapshot: developers like to see the fruits of their labor: when they write a test to kill a mutant, they want to see the mutant killed.


I launched Mutation testing for the Shopping Engineering Productivity team in late 2015. Around 15 of my colleagues were subjected to Mutant findings during their Code reviews, and it was a bumpy start. Each finding has two buttons: Please fix and Not useful, as you can see on the Code review screenshot above. A reviewer can instruct the code author to fix some finding (e.g. a ClangTidy finding might point out that an object is being unnecessarily copied and suggest using a reference instead, or a Mutant finding might point out that code is not well tested). The author and all reviewers can give feedback to the author of the finding/analyzer that their finding is not useful. This is a source of valuable information, and I made use of it. For each mutant that was deemed not useful, I’d check it out and see whether I could generalize from it and add a new rule to my arid node heuristics. Slowly, I collected several hundred heuristics, many of them generally applicable, but many also tied to internal stuff, like monitoring frameworks. More and more, I noticed that just marking nodes as arid and suppressing mutants in them was not enough on its own; a more powerful mechanism was required to reduce this noise even further. Take a look at these motivating examples:

Changing the condition of an in statement, but the body is arid (a logging statement).
if _, err := Del(req); err != nil {
log.Errorf("cleanup failed: %v”, cerr)
}
if _, err := c.Del(req); err == nil {
log.Errorf("cleanup failed: %v”, cerr)
}



Similar pattern, but in C++:

if (!(!status.ok())) {
LOG(WARNING) << "Updating dependency graph failed" << status;
}


I settled for a transitive rule: an AST node is arid if I say it’s arid, or if it’s a compound statement and all its body is also arid. This made sense in retrospect, but it took some looking at reported examples of unproductive mutants to crystalize. Because the logging statements are arid, the whole if statement’s body is arid, and hence, the if statement itself is arid, including its condition. 


In the summer of 2015, my intern, Małgorzata Salawa, and I got mutagenesis implemented for C++, Go, Python, and Java, and having transitive arid node detection and surfacing at most a single mutant per line and 7 per file, we called it a v1.0 and launched. Mutation testing was always an opt-in service, and in the beginning had a few users (93 code reviews in Q1 of 2016), but over time it ramped up to 2,500 users in February 2017, to tens of thousands today. The early days were crucial to get the users’ feedback and extend the arid node heuristics ever further. In the beginning, the Not Useful rate was around 80%, and this was already with some heuristics and at most a single mutant per line. With time, I got it down to around 15%. I was always aware that getting the rate to 0% was impossible, because of the nature of the mutants: sometimes, the mutant would produce an equivalent behavior as the original, and there was no way to avoid that fully.


Changing cached lookup by removing the cache and always recalculating yields functionally equivalent code, undetectable by conventional testing.
func (s *Serv) calculate(in int32) int {
if val, ok := if s.cache[in] {
return val
}

val := s.calc(in)
s.cache[in] = val
return val
}
func (s *Serv) calculate(in int32) int {



val := s.calc(in)
s.cache[in] = val
return val
}



I was both surprised and happy that I could lower the Not useful rate to under 50%.


Mutation Context


As time went by, I added support for more languages. In early 2017, I implemented support for JavaScript and TypeScript, and later in the year I added support for Dart. In 2018 I added support for ZetaSQL. And finally, in early 2020, I added support for Kotlin as it became more and more popular in the Android world.


I kept track of various stats for all mutants: their survival rates and Please fix/Not useful ratios. 


The worst performing mutator was ABS(Absolute Value Mutator) that would replace an expression with ±abs(expression), for example:
absl::Minutes(10) - elapsed;
absl::Minutes(-abs(10)) - elapsed;


Looking at the examples, I had to agree. Because the feedback was predominantly negative for this mutator, I quickly completely disabled it for all languages.


I soon noticed that the SBR (Statement Block Removal) mutator, which deletes statements or whole blocks, is the most common one, and that made sense: while mutating a logical or arithmetic operator required the existence of such an operator in the code to be mutated, any line of code was eligible for deletion. Mutants generated by code deletion, though, did not have the best reported usefulness, or productivity. In fact, almost all other mutators generated more productive mutants than the SBR, and that got me thinking: not all code is the same; a condition within an if statement that contains a return statement is not the same as a condition in another location.


Out of this, an idea was born: context-based mutator selection. For each line, I would randomly shuffle mutator operators and pick one by one until one generated a mutant in that line. That was not ideal, because I knew that some operators worked better than others in some contexts. Rather than just randomly picking a mutant operator, I decided to pick the one most likely to generate a surviving mutant that is then most likely to be productive when reported, based on historical data. I had millions of mutants to learn from, I just needed to define the distance between pieces of code. Finally deciding to look for AST nodes that were in similar contexts as the node being mutated, I looked at the nodes around the node under mutation, and encoded the child-parent relationships of the near-by nodes to capture the AST context. Armed with the distance measure and with the help of my returning intern Małgorzata, it was easy to find the closest AST contexts from historic mutants and to look at their fate and pick the best one. I ordered the mutators by their productivity and tried to generate a mutant in the node, in that order, since it’s quite possible that some of the mutant operators are not applicable on some piece of code.


This was quite an improvement. Both mutant survivability and usefulness improved significantly for all mutant operators and programming languages. You can read more about the findings in the upcoming paper.


Fault Coupling



Mutation testing is only valuable if the test cases we write for mutants are valuable. Mutants do not resemble real bugs, they are simpler than bugs found in the wild. Mutation testing relies on the coupling hypothesis: mutants are coupled with real bugs if a test suite that is sensitive enough to detect mutants is also sensitive enough to detect the more complex real bugs. Reporting mutants and writing tests to kill them only makes sense if they are coupled with real bugs.



I instinctively thought that fault coupling was real, otherwise I would not have worked on mutation testing at all, and I’ve seen many many cases where mutants pointed to a bug; but still, I wanted to verify this hypothesis, if only for our code base. I designed an experiment: I would generate all mutants in a line for explicit bug-fixing changes, before and after the fix, and I would check whether, had mutation testing been run, it would have surfaced a mutant in the change that introduced the bug, and potentially prevented it (i.e., killed in the change that fixed the bug and added new tests cases). I ran this experiment on weekends for over a month, because we did not have the resources to run it during workdays. While I normally generate a single mutant in a line, to test the fault coupling effect, I used the classical mutation testing approach and generated all possible mutants, while still adhering to the arid node suppression. A total of 33 million test suites were executed to test hundreds of thousands of mutants, finally to conclude that, in around 70% of cases, a bug was coupled with a mutant. 


While I was at it, I also checked my intuition on whether a single mutant per line was enough, and found that it was overwhelmingly so: in more than 90% of cases, either all mutants were killed in a line or none was. It’s worth keeping in mind that I still applied my arid node suppression heuristics for this experiment. It was great to finally have confirmation of my intuitions.


I also looked into the developer behavior changes after using mutation testing on a project for longer periods of time, and discovered that projects that use mutation testing get more tests over time, as developers get exposed to more and more mutants. Not only do developers write more test cases, but those test cases are more effective in killing mutants: less and less mutants get reported over time. I noticed this from personal experience too: when writing unit tests, I would see where I cut some corners in the tests, and anticipated the mutant. Now I just add the missing test cases, rather than facing a mutant in my Code review, and I rarely see mutants these days, as I’ve learned to anticipate and preempt them.


You can read more about the findings in our ICSE paper.


Conclusion



It’s been a long road since that hackathon in the winter of 2013. Mutation testing was a lot of fun to work on. It had its challenges, and I had days where I thought that I would throw everything down the drain (I’m looking at you, clang), but I am glad I stuck with it.



The most interesting part of the project was getting Mutation testing to scale to such a large code base, and that required redefining the problem and adapting it to the existing ecosystem that engineers were already used to. Another interesting angle was working, and learning from, the academic community, in particular Gordon Fraser (University of Passau) and René Just (University of Washington).



I would like to encourage everyone to give one of the many open source mutation testing tools a try on their projects. With some tweaks here and there, it can be a great way to keep your software well tested.



Test Flakiness – One of the main challenges of automated testing (Part II)

By George Pirocanac

This is part two of a series on test flakiness. The first article discussed the four components under which tests are run and the possible reasons for test flakiness. This article will discuss the triage tips and remedies for flakiness for each of these possible reasons.


Components


To review, the four components where flakiness can occur include:
  • The tests themselves
  • The test-running framework
  • The application or system under test (SUT) and the services and libraries that the SUT and testing framework depend upon
  • The OS and hardware and network that the SUT and testing framework depend upon

This was captured and summarized in the following diagram.

The reasons, triage tips, and remedies for flakiness are discussed below, by component.



The tests themselves


The tests themselves can introduce flakiness. This can include test data, test workflows, initial setup of test prerequisites, and initial state of other dependencies.


Reason for Flakiness

Tips for Triaging

Type of Remedy

Improper initialization or cleanup.

Look for compiler warnings about uninitialized variables. Inspect initialization and cleanup code. Check that the environment is set up and torn down correctly. Verify that test data is correct.


Explicitly initialize all variables with proper values before their use.

Properly set up and tear down the testing environment. Consider an initial test that verifies the state of the environment.

Invalid assumptions about the state of test data.

Rerun test(s) independently.

Make tests independent of any state from other tests and previous runs.

Invalid assumptions about the state of the system, such as the system time.

Explicitly check for system dependency assumptions.

Remove or isolate the SUT dependencies on aspects of the environment that you do not control.

Dependencies on execution time, expecting asynchronous events to occur in a specific order, waiting without timeouts, or race conditions between the tests and the application.

Log the times when accesses to the application are made.


As part of debugging, introduce delays in the application to check for differences in test results.

Add synchronization elements to the tests so that they wait for specific application states. Disable unnecessary caching to have a predictable timeline for the application responses.

Note: Do NOT add arbitrary delays as these can become flaky again over time and slow down the test unnecessarily.

Dependencies on the order in which the tests are run. (Similar to the second case above.)

Rerun test(s) independently.

Make tests independent of each other and of  any state from previous runs.


Table 1 - Reasons, triage tips, and remedies for flakiness in the tests themselves

The test-running framework


An unreliable test-running framework can introduce flakiness. 


Reason for Flakiness

Tips for Triaging

Type of Remedy

Failure to allocate enough resources for the SUT, thus preventing it from running.

Check logs to see if SUT came up.

Allocate sufficient resources.

Improper scheduling of the tests so they “collide” and cause each other to fail.

Explicitly run tests independently in different order.

Make tests runnable independently of each other.

Insufficient system resources to satisfy the test requirements. (Similar to the first case but here resources are consumed while running the workflow.)

Check system logs to see if SUT ran out of resources.

Fix memory leaks or similar resource “bleeding.”


Allocate sufficient resources to run tests.


Table 2 - Reasons, triage tips, and remedies for flakiness in the test running framework


The application or SUT and the services and libraries that the SUT and testing framework depend upon


Of course, the application itself (or the SUT) could be the source of flakiness. 
An application can also have numerous dependencies on other services, and each of those services can have their own dependencies. In this chain, each of the services can introduce flakiness. 


Reason for Flakiness

Tips for Triaging

Type of Remedy

Race conditions.

Log accesses of shared resources.

Add synchronization elements to the tests so that they wait for specific application states. Note: Do NOT add arbitrary delays as these can become flaky again over time.

Uninitialized variables.

Look for compiler warnings about uninitialized variables.


Explicitly initialize all variables with proper values before their use.

Being slow to respond or being unresponsive to the stimuli from the tests.

Log the times when requests and responses are made.

Check and remove any causes for delays.

Memory leaks.

Look at memory consumption during test runs. Use tools such as Valgrind to detect.

Fix programming error causing memory leak. This Wikipedia article has an excellent discussion on these types of errors.

Oversubscription of resources.

Check system logs to see if SUT ran out of resources.

Allocate sufficient resources to run tests.


Changes to the application (or dependent services) out of sync with the corresponding tests.

Examine revision history.

Institute a policy requiring code changes to be accompanied by tests.


Table 3 - Reasons, triage tips, and remedies for flakiness in the application or SUT


The OS and hardware that the SUT and testing framework depend upon


Finally, the underlying hardware and operating system can be sources of test flakiness. 


Reason for Flakiness

Tips for Triaging

Type of Remedy

Networking failures or instability.

Check for hardware errors in system logs.

Fix hardware errors or run tests on different hardware.

Disk errors.

Check for hardware errors in system logs.

Fix hardware errors or run tests on different hardware.

Resources being consumed by other tasks/services not related to the tests being run.


Examine system process activity.

Reduce activity of other processes on test system(s).


Table 4 - Reasons, triage tips, and remedies for flakiness in the OS and hardware of the SUT


Conclusion

As can be seen from the wide variety of failures, having low flakiness in automated testing can be quite a challenge. This article has outlined both the components under which tests are run and the types of flakiness that can occur, and thus can serve as a cheat sheet when triaging and fixing flaky tests.


References











Test Flakiness – One of the main challenges of automated testing

By George Pirocanac


Dealing with test flakiness is a critical skill in testing because automated tests that do not provide a consistent signal will slow down the entire development process. If you haven’t encountered flaky tests, this article is a must-read as it first tries to systematically outline the causes for flaky tests. If you have encountered flaky tests, see how many fall into the areas listed.


A follow-up article will talk about dealing with each of the causes.


Over the years I’ve seen a lot of reasons for flaky tests, but rather than review them one by one, let’s group the sources of flakiness by the components under which tests are run:
  • The tests themselves
  • The test-running framework
  • The application or system under Test (SUT) and the services and libraries that the SUT and testing framework depend upon
  • The OS and hardware that the SUT and testing framework depend upon

This is illustrated below. Figure 1 first shows the hardware/software stack that supports an application or system under test. At the lowest level is the hardware. The next level up is the operating system followed by the libraries that provide an interface to the system. At the highest level, is the middleware, the layer that provides application specific interfaces.



In a distributed system, however, each of the services of the application and the services it depends upon can reside on a different hardware / software stack as can the test running service. This is illustrated in Figure 2 as the full test running environment.




As discussed above, each of these components is a potential area for flakiness.


The tests themselves


The tests themselves can introduce flakiness. Typical causes include:
  • Improper initialization or cleanup.
  • Invalid assumptions about the state of test data.
  • Invalid assumptions about the state of the system. An example can be the system time.
  • Dependencies on the timing of the application.
  • Dependencies on the order in which the tests are run. (Similar to the first case above.)


The test-running framework


An unreliable test-running framework can introduce flakiness. Typical causes include:

  • Failure to allocate enough resources for the system under test thus causing it to fail coming up. 
  • Improper scheduling of the tests so they “collide” and cause each other to fail.
  • Insufficient system resources to satisfy the test requirements.

The application or system under test and the services and libraries that the SUT and testing framework depend upon


Of course, the application itself (or the system under test) could be the source of flakiness. An application can also have numerous dependencies on other services, and each of those services can have their own dependencies. In this chain, each of the services can introduce flakiness. Typical causes include:
  • Race conditions.
  • Uninitialized variables.
  • Being slow to respond or being unresponsive to the stimuli from the tests.
  • Memory leaks.
  • Oversubscription of resources.
  • Changes to the application (or dependent services) happening at a different pace than those to the corresponding tests.

Testing environments are called hermetic when they contain everything that is needed to run the tests (i.e. no external dependencies like servers running in production). Hermetic environments, in general, are less likely to be flaky.

The OS and hardware that the SUT and testing framework depend upon



Finally, the underlying hardware and operating system can be the source of test flakiness. Typical causes include:
  • Networking failures or instability.
  • Disk errors.
  • Resources being consumed by other tasks/services not related to the tests being run.

As can be seen from the wide variety of failures, having low flakiness in automated testing can be quite a challenge. This article has both outlined the areas and the types of flakiness that can occur in those areas, so it can serve as a cheat sheet when triaging flaky tests.


In the follow-up of this blog we’ll look at ways of addressing these issues.


References




Testing on the Toilet: Separation of Concerns? That’s a Wrap!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


By Stefan Kennedy


The following function decodes a byte array as an image using an API named SpeedyImg. What maintenance problems might arise due to referencing an API owned by a different team?

SpeedyImgImage decodeImage(List<SpeedyImgDecoder> decoders, byte[] data) {
SpeedyImgOptions options = getDefaultConvertOptions();
for (SpeedyImgDecoder decoder : decoders) {
SpeedyImgResult decodeResult = decoder.decode(decoder.formatBytes(data));
SpeedyImgImage image = decodeResult.getImage(options);
if (validateGoodImage(image)) { return image; }
}
throw new RuntimeException();
}



Details about how to call the API are mixed with domain logic, which can make the code harder to understand. For example, the call to decoder.formatBytes() is required by the API, but how the bytes are formatted isn’t relevant to the domain logic.


Additionally, if this API is used in many places across a codebase, then all usages may need to change if the way the API is used changes. For example, if the return type of this function is changed to the more generic SpeedyImgResult type, usages of SpeedyImgImage would need to be updated.


To avoid these maintenance problems, create wrapper types to hide API details behind an abstraction:

Image decodeImage(List<ImageDecoder> decoders, byte[] data) {
for (ImageDecoder decoder : decoders) {
Image decodedImage = decoder.decode(data);
if (validateGoodImage(decodedImage)) { return decodedImage; }
}
throw new RuntimeException();
}


Wrapping an external API follows the Separation of Concerns principle, since the logic for how the API is called is separated from the domain logic. This has many benefits, including:
  • If the way the API is used changes, encapsulating the API in a wrapper insulates how far those changes can propagate across your codebase.
  • You can modify the interface or the implementation of types you own, but you can’t for API types.
  • It is easier to switch or add another API, since they can still be represented by the introduced types (e.g. ImageDecoder/Image).
  • Readability can improve as you don’t need to sift through API code to understand core logic.

Not all external APIs need to be wrapped. For example, if an API would take a huge effort to separate or is simple enough that it doesn't pollute the codebase, it may be better not to introduce wrapper types (e.g. library types like List in Java or std::vector in C++). When in doubt, keep in mind that a wrapper should only be added if it will clearly improve the code (see the YAGNI principle).


“Separation of Concerns” in the context of external APIs is also described by Martin Fowler in his blog post, Refactoring code that accesses external services


Testing on the Toilet: Separation of Concerns? That’s a Wrap!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


By Stefan Kennedy


The following function decodes a byte array as an image using an API named SpeedyImg. What maintenance problems might arise due to referencing an API owned by a different team?

SpeedyImgImage decodeImage(List<SpeedyImgDecoder> decoders, byte[] data) {
SpeedyImgOptions options = getDefaultConvertOptions();
for (SpeedyImgDecoder decoder : decoders) {
SpeedyImgResult decodeResult = decoder.decode(decoder.formatBytes(data));
SpeedyImgImage image = decodeResult.getImage(options);
if (validateGoodImage(image)) { return image; }
}
throw new RuntimeException();
}



Details about how to call the API are mixed with domain logic, which can make the code harder to understand. For example, the call to decoder.formatBytes() is required by the API, but how the bytes are formatted isn’t relevant to the domain logic.


Additionally, if this API is used in many places across a codebase, then all usages may need to change if the way the API is used changes. For example, if the return type of this function is changed to the more generic SpeedyImgResult type, usages of SpeedyImgImage would need to be updated.


To avoid these maintenance problems, create wrapper types to hide API details behind an abstraction:

Image decodeImage(List<ImageDecoder> decoders, byte[] data) {
for (ImageDecoder decoder : decoders) {
Image decodedImage = decoder.decode(data);
if (validateGoodImage(decodedImage)) { return decodedImage; }
}
throw new RuntimeException();
}


Wrapping an external API follows the Separation of Concerns principle, since the logic for how the API is called is separated from the domain logic. This has many benefits, including:
  • If the way the API is used changes, encapsulating the API in a wrapper insulates how far those changes can propagate across your codebase.
  • You can modify the interface or the implementation of types you own, but you can’t for API types.
  • It is easier to switch or add another API, since they can still be represented by the introduced types (e.g. ImageDecoder/Image).
  • Readability can improve as you don’t need to sift through API code to understand core logic.

Not all external APIs need to be wrapped. For example, if an API would take a huge effort to separate or is simple enough that it doesn't pollute the codebase, it may be better not to introduce wrapper types (e.g. library types like List in Java or std::vector in C++). When in doubt, keep in mind that a wrapper should only be added if it will clearly improve the code (see the YAGNI principle).


“Separation of Concerns” in the context of external APIs is also described by Martin Fowler in his blog post, Refactoring code that accesses external services


Testing on the Toilet: Separation of Concerns? That’s a Wrap!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


By Stefan Kennedy


The following function decodes a byte array as an image using an API named SpeedyImg. What maintenance problems might arise due to referencing an API owned by a different team?

SpeedyImgImage decodeImage(List<SpeedyImgDecoder> decoders, byte[] data) {
SpeedyImgOptions options = getDefaultConvertOptions();
for (SpeedyImgDecoder decoder : decoders) {
SpeedyImgResult decodeResult = decoder.decode(decoder.formatBytes(data));
SpeedyImgImage image = decodeResult.getImage(options);
if (validateGoodImage(image)) { return image; }
}
throw new RuntimeException();
}



Details about how to call the API are mixed with domain logic, which can make the code harder to understand. For example, the call to decoder.formatBytes() is required by the API, but how the bytes are formatted isn’t relevant to the domain logic.


Additionally, if this API is used in many places across a codebase, then all usages may need to change if the way the API is used changes. For example, if the return type of this function is changed to the more generic SpeedyImgResult type, usages of SpeedyImgImage would need to be updated.


To avoid these maintenance problems, create wrapper types to hide API details behind an abstraction:

Image decodeImage(List<ImageDecoder> decoders, byte[] data) {
for (ImageDecoder decoder : decoders) {
Image decodedImage = decoder.decode(data);
if (validateGoodImage(decodedImage)) { return decodedImage; }
}
throw new RuntimeException();
}


Wrapping an external API follows the Separation of Concerns principle, since the logic for how the API is called is separated from the domain logic. This has many benefits, including:
  • If the way the API is used changes, encapsulating the API in a wrapper insulates how far those changes can propagate across your codebase.
  • You can modify the interface or the implementation of types you own, but you can’t for API types.
  • It is easier to switch or add another API, since they can still be represented by the introduced types (e.g. ImageDecoder/Image).
  • Readability can improve as you don’t need to sift through API code to understand core logic.

Not all external APIs need to be wrapped. For example, if an API would take a huge effort to separate or is simple enough that it doesn't pollute the codebase, it may be better not to introduce wrapper types (e.g. library types like List in Java or std::vector in C++). When in doubt, keep in mind that a wrapper should only be added if it will clearly improve the code (see the YAGNI principle).


“Separation of Concerns” in the context of external APIs is also described by Martin Fowler in his blog post, Refactoring code that accesses external services


Fixing a Test Hourglass

By Alan Myrvold


Automated tests make it safer and faster to create new features, fix bugs, and refactor code. When planning the automated tests, we envision a pyramid with a strong foundation of small unit tests, some well designed integration tests, and a few large end-to-end tests. From Just Say No to More End-to-End Tests, tests should be fast, reliable, and specific; end-to-end tests, however, are often slow, unreliable, and difficult to debug.


As software projects grow, often the shape of our test distribution becomes undesirable, either top heavy (no unit or medium integration tests), or like an hourglass.


The hourglass test distribution has a large set of unit tests, a large set of end-to-end tests, and few or no medium integration tests.


                      

To transform the hourglass back into a pyramid — so that you can test the integration of components in a reliable, sustainable way — you need to figure out how to architect the system under test and test infrastructure and make system testability improvements and test-code improvements.


I worked on a project with a web UI, a server, and many backends. There were unit tests at all levels with good coverage and a quickly increasing set of end-to-end tests.


The end-to-end tests found issues that the unit tests missed, but they ran slowly, and environmental issues caused spurious failures, including test data corruption. In addition, some functional areas were difficult to test because they covered more than the unit but required state within the system that was hard to set up.




We eventually found a good test architecture for faster, more reliable integration tests, but with some missteps along the way.

An example UI-level end-to-end test, written in protractor, looked something like this:


describe('Terms of service are handled', () => {
it('accepts terms of service', async () => {
const user = getUser('termsNotAccepted');
await login(user);
await see(termsOfServiceDialog());
await click('Accept')
await logoff();
await login(user);
await not.see(termsOfServiceDialog());
});
});


This test logs on as a user, sees the terms of service dialog that the user needs to accept, accepts it, then logs off and logs back on to ensure the user is not prompted again.


This terms of service test was a challenge to run reliably, because once an agreement was accepted, the backend server had no RPC method to reverse the operation and “un-accept” the TOS. We could create a new user with each test, but that was time consuming and hard to clean up.


The first attempt to make the terms of service feature testable without end-to-end testing was to hook the server RPC method and set the expectations within the test. The hook intercepts the RPC call and provides expected results instead of calling the backend API.




This approach worked. The test interacted with the backend RPC without really calling it, but it cluttered the test with extra logic.


describe('Terms of service are handled', () => {
it('accepts terms of service', async () => {
const user = getUser('someUser');
await hook('TermsOfService.Get()', true);
await login(user);
await see(termsOfServiceDialog());
await click('Accept')
await logoff();
await hook('TermsOfService.Get()', false);
await login(user);
await not.see(termsOfServiceDialog());
});
});



The test met the goal of testing the integration of the web UI and server, but it was unreliable. As the system scaled under load, there were several server processes and no guarantee that the UI would access the same server for all RPC calls, so the hook might be set in one server process and the UI accessed in another. 


The hook also wasn't at a natural system boundary, which made it require more maintenance as the system evolved and code was refactored.


The next design of the test architecture was to fake the backend that eventually processes the terms of service call.


The fake implementation can be quite simple:

public class FakeTermsOfService implements TermsOfService.Service {
private static final Map<String, Boolean> accepted = new ConcurrentHashMap<>();

@Override
public TosGetResponse get(TosGetRequest req) {
return accepted.getOrDefault(req.UserID(), Boolean.FALSE);
}

@Override
public void accept(TosAcceptRequest req) {
accepted.put(req.UserID(), Boolean.TRUE);
}
}



And the test is now uncluttered by the expectations:

describe('Terms of service are handled', () => {
  it('accepts terms of service', async () => {
const user = getUser('termsNotAccepted');
await login(user);
await see(termsOfServiceDialog());
await click('Accept')
await logoff();
await login(user);
await not.see(termsOfServiceDialog());
});
});


Because the fake stores the accepted state in memory, there is no need to reset the state for the next test iteration; it is enough just to restart the fake server.


This worked but was problematic when there was a mix of fake and real backends. This was because there was state between the real backends that was now out of sync with the fake backend.


Our final, successful integration test architecture was to provide fake implementations for all except one of the backends, all sharing the same in-memory state. One real backend was included in the system under test because it was tightly coupled with the Web UI. Its dependencies were all wired to fake backends. These are integration tests over the entire system under test, but they remove the backend dependencies. These tests expand the medium size tests in the test hourglass, allowing us to have fewer end-to-end tests with real backends.


Note that these integration tests are not only the option. For logic in the Web UI, we can write page level unit tests, which allow the tests to run faster and more reliably. For the terms of service feature, however, we want to test the Web UI and server logic together, so integration tests are a good solution.





This resulted in UI tests that ran, unmodified, on both the real and fake backend systems. 


When run with fake backends the tests were faster and more reliable. This made it easier to add test scenarios that would have been more challenging to set up with the real backends. We also deleted end-to-end tests that were well duplicated by the integration tests, resulting in more integration tests than end-to-end tests.



By iterating, we arrived at a sustainable test architecture for the integration tests.


If you're facing a test hourglass the test architecture to devise medium tests may not be obvious. I'd recommend experimenting, dividing the system on well defined interfaces, and making sure the new tests are providing value by running faster and more reliably or by unlocking hard to test areas.


References



Testing on the Toilet: Testing UI Logic? Follow the User!

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Carlos Israel Ortiz García


After years of anticipation, you're finally able to purchase Google's hottest new product, gShoe*. But after clicking the "Buy" button, nothing happened! Inspecting the HTML, you notice the problem:

<button disabled=”true” click=”$handleBuyClick(data)”>Buy</button>

Users couldn’t buy their gShoes because the “Buy” button was disabled. The problem was due to the unit test for handleBuyClick, which passed even though the user interface had a bug:

it('submits purchase request', () => {
controller = new PurchasePage();
// Call the method that handles the "Buy" button click
controller.handleBuyClick(data);
expect(service).toHaveBeenCalledWith(expectedData);
});

In the above example, the test failed to detect the bug because it bypassed the UI element and instead directly invoked the "Buy" button click handler. To be effective, tests for UI logic should interact with the components on the page as a browser would, which allows testing the behavior that the end user experiences. Writing tests against UI components rather than calling handlers directly faithfully simulates user interactions (e.g., add items to a shopping cart, click a purchase button, or verify an element is visible on the page), making the tests more comprehensive.


The test for the “Buy” button should instead exercise the entire UI component by interacting with the HTML element, which would have caught the disabled button issue:

it('submits purchase request', () => {
// Renders the page with the “Buy” button and its associated code.
render(PurchasePage);
// Tries to click the button, fails the test, and catches the bug!
buttonWithText('Buy').dispatchEvent(new Event(‘click’));
expect(service).toHaveBeenCalledWith(expectedData);
});


Why should tests be written this way? Unlike end-to-end tests, tests for individual UI components don’t require a backend server or the entire app to be rendered. Instead, these  tests run in the same self-contained environment and take a similar amount of time to execute as unit tests that just execute the underlying event handlers directly. Therefore, the UI acts as the public API, leaving the business logic as an implementation detail (also known as the "Use the Front Door First" principle), resulting in better coverage of a feature.

Disclaimer: “gShoe” is not a real Google product. Unfortunately you can’t buy a pair even if the bug is fixed!

Testing on the Toilet: Avoid Hardcoding Values for Better Libraries

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

By Adel Saoud


You may have been in a situation where you're using a value that always remains the same, so you define a constant. This can be a good practice as it removes magic values and improves code readability. But be mindful that hardcoding values can make usability and potential refactoring significantly harder.

Consider the following function that relies on hardcoded values:
// Declared in the module.
constexpr int kThumbnailSizes[] = {480, 576, 720};

// Returns thumbnails of various sizes for the given image.
std::vector<Image> GetThumbnails(const Image& image) {
std::vector<Image> thumbnails;
for (const int size : kThumbnailSizes) {
thumbnails.push_back(ResizeImage(image, size));
}
return thumbnails;
}


Using hardcoded values can make your code:
  • Less predictable: The caller might not expect the function to be relying on hardcoded values outside its parameters; a user of the function shouldn’t need to read the function’s code to know that. Also, it is difficult to predict the product/resource/performance implications of changing these hardcoded values.
  • Less reusable: The caller is not able to call the function with different values and is stuck with the hardcoded values. If the caller doesn’t need all these sizes or needs a different size, the function has to be forked or refactored to avoid aforementioned complications with existing callers.

When designing a library, prefer to pass required values, such as through a function call or a constructor. The code above can be improved as follows:
std::vector<Image> GetThumbnails(const Image& image, absl::Span<const int> sizes) {
std::vector<Image> thumbnails;
for (const int size : sizes) {
thumbnails.push_back(ResizeImage(image, size));
}
return thumbnails;
}


If most of the callers use the same value for a certain parameter, make your code configurable so that this value doesn't need to be duplicated by each caller. For example, you can define a public constant that contains a commonly used value, or use default arguments in languages that support this feature (e.g. C++ or Python).
// Declared in the public header.
inline constexpr int kDefaultThumbnailSizes[] = {480, 576, 720};

// Default argument allows the function to be used without specifying a size.
std::vector<Image> GetThumbnails(const Image& image,
absl::Span<const int> sizes = kDefaultThumbnailSizes);