Category Archives: Google Testing Blog

If it ain’t broke, you’re not trying hard enough

Code Health: Providing Context with Commit Messages and Bug Reports

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Chris Lewis

You are caught in a trap. Mechanisms whirl around you, but they make no sense. You desperately search the room and find the builder's original plans! The description of the work order that implemented the trap reads, "Miscellaneous fixes." Oh dear.

Reading other engineers' code can sometimes feel like an archaeology expedition, full of weird and wonderful statements that are hard to decipher. Code is always written with a purpose, but sometimes that purpose is not clear in the code itself. You can address this knowledge gap by documenting the context that explains why a change was needed. Code comments provide context, but comments alone sometimes can’t provide enough.

There are two key ways to indicate context:
Commit Messages

  • A commit message is one of the easiest, most discoverable means of providing context. When you encounter lines of code that may be unclear, checking the commit message which introduced the code is a great way to gain more insight into what the code is meant to do.
  • Write the first line of the commit message so it stands alone, as tools like GitHub will display this line in commit listing pages. Stand-alone first lines allow you to skim through code history much faster, quickly building up your understanding of how a source file evolved over time. Example:  
    Add Frobber to the list of available widgets.

    This allows consumers to easily discover the new Frobber widget and
    add it to their application.
Bug Reports
  • You can use a bug report to track the entire story of a bug/feature/refactoring, adding context such as the original problem description, the design discussions between the team, and the commits that are used to solve the problem. This lets you easily see all related commits in one place, and allows others to easily keep track of the status of a particular problem.
  • Most commits should reference a bug report. Standalone commits (e.g. one-time cleanups or other small unplanned changes) don't need their own bug report, though, since they often contain all their context within the description and the source changes.
Informative commit messages and bug reports go hand-in-hand, providing context from different perspectives. Keep in mind that such context can be useful even to yourself, providing an easy reminder about the work you did last week, last quarter, or even last year. Future you will thank past you!

Code Health: Eliminate YAGNI Smells

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Marc Eaddy

The majority of software development costs are due to maintenance. One way to reduce maintenance costs is to implement something only when you actually need it, a.k.a. the “You Aren't Gonna Need It” (YAGNI) design principle. How do you spot unnecessary code? Follow your nose!

A code smell is a code pattern that usually indicates a design flaw. For example, creating a base class or interface with only one subclass may indicate a speculation that more subclasses will be needed in the future. Instead, practice incremental development and design: don't add the second subclass until it is actually needed.

The following C++ code has many YAGNI smells:
class Mammal { ...
virtual Status Sleep(bool hibernate) = 0;
};
class Human : public Mammal { ...
virtual Status Sleep(bool hibernate) {
age += hibernate ? kSevenMonths : kSevenHours;
return OK;
}
};

Maintainers are burdened with understanding, documenting, and testing both classes when only one is really needed. Code must handle the case when hibernate is true, even though all callers pass false, as well as the case when Sleep returns an error, even though that never happens. This results in unnecessary code that never executes. Eliminating those smells simplifies the code:

class Human { ...
void Sleep() { age += kSevenHours; }
};

Here are some other YAGNI smells:
  • Code that has never been executed other than by tests (a.k.a. code that is dead on arrival)
  • Classes designed to be subclassed (have virtual methods and/or protected members) that are not actually subclassed
  • Public or protected methods or fields that could be private
  • Parameters, variables, or flags that always have the same value
Thankfully, YAGNI smells, and code smells in general, are often easy to spot by looking for simple patterns and are easy to eliminate using simple refactorings.

Are you thinking of adding code that won't be used today? Trust me, you aren't gonna need it!


Code Health: To Comment or Not to Comment?

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Dori Reuveni and Kevin Bourrillion

While reading code, often there is nothing more helpful than a well-placed comment. However, comments are not always good. Sometimes the need for a comment can be a sign that the code should be refactored.

Use a comment when it is infeasible to make your code self-explanatory. If you think you need a comment to explain what a piece of code does, first try one of the following:
  • Introduce an explaining variable.
    // Subtract discount from price.
    finalPrice = (numItems * itemPrice)
    - min(5, numItems) * itemPrice * 0.1;
    price = numItems * itemPrice;
    discount =
    min(5, numItems) * itemPrice * 0.1;
    finalPrice = price - discount;
  • Extract a method.
    // Filter offensive words.
    for (String word : words) { ... }
    filterOffensiveWords(words);
  • Use a more descriptive identifier name.
    int width = ...; // Width in pixels.
    int widthInPixels = ...;
  • Add a check in case your code has assumptions.
    // Safe since height is always > 0.
    return width / height;
    checkArgument(height > 0);
    return width / height;
There are cases where a comment can be helpful:
  • Reveal your intent: explain why the code does something (as opposed to what it does).
    // Compute once because it’s expensive.
  • Protect a well-meaning future editor from mistakenly “fixing” your code.
    // Create a new Foo instance because Foo is not thread-safe.
  • Clarification: a question that came up during code review or that readers of the code might have.
    // Note that order matters because...
  • Explain your rationale for what looks like a bad software engineering practice.
    @SuppressWarnings("unchecked") // The cast is safe because...
On the other hand, avoid comments that just repeat what the code does. These are just noise:
// Get all users.
userService.getAllUsers();
// Check if the name is empty.
if (name.isEmpty()) { ... }

Evolution of GTAC and Engineering Productivity

When Google first hosted GTAC in 2006, we didn’t know what to expect. We kicked off this conference with the intention to share our innovation in test automation, learn from others in the industry and connect with academia. Over the last decade we’ve had great participation and had the privilege to host GTAC in North America, Europe and Asia -- largely thanks to the many of you who spoke, participated and connected!

In the recent months, we’ve been taking a hard look at the discipline of Engineering Productivity as a logical next step in the evolution of test automation. In that same vein, we’re going to rethink what an Engineering Productivity focused conference should look like today.  As we pivot, we will be extending these changes to GTAC and because we expect changes in theme, content and format, we are canceling the upcoming event scheduled in London this November. We’ll be bringing the event back in 2018 with a fresh outlook and strategy.

While we know this may be disappointing for many of the folks who were looking forward to GTAC, we’re excited to come back with a new format which will serve this conference well in today’s environment.

Code Health: Too Many Comments on Your Code Reviews?

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Tom O'Neill


Code reviews can slow down an individual code change, but they’re also an opportunity to improve your code and learn from another intelligent, experienced engineer. How can you get the most out of them?
Aim to get most of your changes approved in the first round of review, with only minor comments. If your code reviews frequently require multiple rounds of comments, these tips can save you time.

Spend your reviewers’ time wisely—it’s a limited resource. If they’re catching issues that you could easily have caught yourself, you’re lowering the overall productivity of your team.
Before you send out the code review:
  • Re-evaluate your code: Don’t just send the review out as soon as the tests pass. Step back and try to rethink the whole thing—can the design be cleaned up? Especially if it’s late in the day, see if a better approach occurs to you the next morning. Although this step might slow down an individual code change, it will result long-term in greater average throughput.
  • Consider an informal design discussion: If there’s something you’re not sure about, pair program, talk face-to-face, or send an early diff and ask for a “pre-review” of the overall design.
  • Self-review the change: Try to look at the code as critically as possible from the standpoint of someone who doesn’t know anything about it. Your code review tool can give you a radically different view of your code than the IDE. This can easily save you a round trip.
  • Make the diff easy to understand: Multiple changes at once make the code harder to review. When you self-review, look for simple changes that reduce the size of the diff. For example, save significant refactoring or formatting changes for another code review.
  • Don’t hide important info in the submit message: Put it in the code as well. Someone reading the code later is unlikely to look at the submit message.
When you’re addressing code review comments:
  • Re-evaluate your code after addressing non-trivial comments: Take a step back and really look at the code with fresh eyes. Once you’ve made one set of changes, you can often find additional improvements that are enabled or suggested by those changes. Just as with any refactoring, it may take several steps to reach the best design.
  • Understand why the reviewer made each comment: If you don’t understand the reasoning behind a comment, don’t just make the change—seek out the reviewer and learn something new.
  • Answer the reviewer’s questions in the code: Don’t just reply—make the code easier to understand (e.g., improve a variable name, change a boolean to an enum) or add a comment. Someone else is going to have the same question later on.

Code Health: Reduce Nesting, Reduce Complexity

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Elliott Karpilovsky

Deeply nested code hurts readability and is error-prone. Try spotting the bug in the two versions of this code:

Code with too much nesting Code with less nesting
response = server.Call(request)

if response.GetStatus() == RPC.OK:
if response.GetAuthorizedUser():
if response.GetEnc() == 'utf-8':
if response.GetRows():
vals = [ParseRow(r) for r in
response.GetRows()]
avg = sum(vals) / len(vals)
return avg, vals
else:
raise EmptyError()
else:
raise AuthError('unauthorized')
else:
raise ValueError('wrong encoding')
else:
raise RpcError(response.GetStatus())
response = server.Call(request)

if response.GetStatus() != RPC.OK:
raise RpcError(response.GetStatus())

if not response.GetAuthorizedUser():
raise ValueError('wrong encoding')

if response.GetEnc() != 'utf-8':
raise AuthError('unauthorized')

if not response.GetRows():
raise EmptyError()

vals = [ParseRow(r) for r in
response.GetRows()]
avg = sum(vals) / len(vals)
return avg, vals


Answer: the "wrong encoding" and "unauthorized" errors are swapped. This bug is easier to see in the refactored version, since the checks occur right as the errors are handled.

The refactoring technique shown above is known as guard clauses. A guard clause checks a criterion and fails fast if it is not met. It decouples the computational logic from the error logic. By removing the cognitive gap between error checking and handling, it frees up mental processing power. As a result, the refactored version is much easier to read and maintain.

Here are some rules of thumb for reducing nesting in your code:
  • Keep conditional blocks short. It increases readability by keeping things local.
  • Consider refactoring when your loops and branches are more than 2 levels deep.
  • Think about moving nested logic into separate functions. For example, if you need to loop through a list of objects that each contain a list (such as a protocol buffer with repeated fields), you can define a function to process each object instead of using a double nested loop.
Reducing nesting results in more readable code, which leads to discoverable bugs, faster developer iteration, and increased stability. When you can, simplify!

GTAC Diversity Scholarship



by Lesley Katzen on behalf of the GTAC Diversity Committee


We are committed to increasing diversity at GTAC, and we believe the best way to do that is by making sure we have a diverse set of applicants to speak and attend. As part of that commitment, we are excited to announce that we will be offering travel scholarships again this year.
Travel scholarships will be available for selected applicants from traditionally underrepresented groups in technology.

To be eligible for a grant to attend GTAC, applicants must:
  • Be 18 years of age or older.
  • Be from a traditionally underrepresented group in technology.
  • Work or study in Computer Science, Computer Engineering, Information Technology, or a technical field related to software testing.
  • Be able to attend core dates of GTAC, November 14th - 15th 2017 in London, England.
To apply:
You must fill out the following scholarship formand register for GTAC to be considered for a travel scholarship.
The deadline for submission is July 1st. Scholarship recipients will be announced on August 15th. If you are selected, we will contact you with information on how to proceed with booking travel.

What the scholarship covers:
Google will pay for round-trip standard coach class airfare to London for selected scholarship recipients, and 3 nights of accommodations in a hotel near the Google King's Cross campus. Breakfast and lunch will be provided for GTAC attendees and speakers on both days of the conference. We will also provide a £75.00 gift card for other incidentals such as airport transportation or meals. You will need to provide your own credit card to cover any hotel incidentals.

Google is dedicated to providing a harassment-free and inclusive conference experience for everyone. Our anti-harassment policy can be found at:
https://www.google.com/events/policy/anti-harassmentpolicy.html

GTAC 2017 – Registration is open!

by Diego Cavalcanti on behalf of the GTAC 2017 Committee
The Google Test Automation Conference (GTAC) is an annual test automation conference hosted by Google. It brings together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It is a great opportunity to present, learn, and challenge modern testing technologies and strategies.

We are pleased to announce that this year, GTAC will be held in Google's London office on November 14th and 15th, 2017.

Registration is currently OPEN for attendees and speakers. See more information here.

The schedule for the upcoming months is as follows:
  • May 15, 2017 - Registration opens for speakers and attendees, including applicants for the diversity scholarship.
  • July 1, 2017 - Registration closes for speaker submissions.
  • July 15, 2017 - Registration closes for attendee submissions.
  • August 15, 2017 - Selected speakers and attendees will be notified.
  • November 13, 2017 - Rehearsal day for speakers (not open for attendees).
  • November 14-15, 2017 - GTAC 2017!
As part of our efforts to increase diversity of speakers and attendees at GTAC, we will again be offering travel scholarships for selected applicants from traditionally underrepresented groups in technology. Please find more information here.

Please do not hesitate to contact gtac2017@google.com if you have any questions. We look forward to seeing you in London!

OSS-Fuzz: Five Months Later, and Rewarding Projects

By Oliver Chang, Abhishek Arya (Security Engineers, Chrome Security), Kostya Serebryany (Software Engineer, Dynamic Tools), and Josh Armour (Security Program Manager)

Five months ago, we announcedOSS-Fuzz, Google's effort to help make open source software more secure and stable. Since then, our robot army has been working hard at fuzzing, processing 10 trillion test inputs a day. Thanks to the efforts of the open source community who have integrated a total of 47 projects, we've found over 1,000bugs (264of which are potential security vulnerabilities).
Breakdown of the types of bugs we're finding

Notable results

OSS-Fuzz has found numerous security vulnerabilities in several critical open source projects: 10in FreeType2, 17in FFmpeg, 33in LibreOffice, 8in SQLite 3, 10in GnuTLS, 25in PCRE2, 9in gRPC, and 7in Wireshark. We've also had at least one bug collision with another independent security researcher (CVE-2017-2801). (Some of the bugs are still view-restricted so links may show smaller numbers.)

Once a project is integrated into OSS-Fuzz, the continuous and automated nature of OSS-Fuzz means that we often catch these issues just hours after the regression is introduced into the upstream repository, so that the chances of users being affected is reduced.

Fuzzing not only finds memory safety related bugs, it can also find correctness or logic bugs. One example is a carry propagating bug in OpenSSL (CVE-2017-3732).

Finally, OSS-Fuzz has reported over 300 timeout and out-of-memory failures (~75% of which got fixed). Not every project treats these as bugs, but fixing them enables OSS-Fuzz to find more interesting bugs.

Announcing rewards for open source projects

We believe that user and internet security as a whole can benefit greatly if more open source projects include fuzzing in their development process. To this end, we'd like to encourage more projects to participate and adopt the ideal integration guidelines that we've established.

Combined with fixing all the issues that are found, this is often a significant amount of work for developers who may be working on an open source project in their spare time. To support these projects, we are expanding our existing Patch Rewardsprogram to include rewards for the integration of fuzz targets into OSS-Fuzz.

To qualify for these rewards, a project needs to have a large user base and/or be critical to global IT infrastructure. Eligible projects will receive $1,000 for initial integration, and up to $20,000 for ideal integration (the final amount is at our discretion). You have the option of donating these rewards to charity instead, and Google will double the amount.

To qualify for the ideal integration reward, projects must show that:
  • Fuzz targets are checked into their upstream repository and integrated in the build system with sanitizer support (up to $5,000).
  • Fuzz targets are efficientand provide good code coverage (>80%) (up to $5,000).
  • Fuzz targets are part of the official upstream development and regression testing process, i.e. they are maintained, run against old known crashers and the periodically updated corpora(up to $5,000).
  • The last $5,000 is a "l33t" bonus that we may reward at our discretion for projects that we feel have gone the extra mile or done something really awesome.
We've already started to contact the first round of projects that are eligible for the initial reward. If you are the maintainer or point of contact for one of these projects, you may also reach out to us in order to apply for our ideal integration rewards.

The future

We'd like to thank the existing contributors who integrated their projects and fixed countless bugs. We hope to see more projects integrated into OSS-Fuzz, and greater adoption of fuzzing as standard practice when developing software.

Where do our flaky tests come from?

author: Jeff Listfield

When tests fail on code that was previously tested, this is a strong signal that something is newly wrong with the code. Before, the tests passed and the code was correct; now the tests fail and the code is not working right. The goal of a good test suite is to make this signal as clear and directed as possible.

Flaky (nondeterministic) tests, however, are different. Flaky tests are tests that exhibit both a passing and a failing result with the same code. Given this, a test failure may or may not mean that there's a new problem. And trying to recreate the failure, by rerunning the test with the same version of code, may or may not result in a passing test. We start viewing these tests as unreliable and eventually they lose their value. If the root cause is nondeterminism in the production code, ignoring the test means ignoring a production bug.
Flaky Tests at Google

Google has around 4.2 million tests that run on our continuous integration system. Of these, around 63 thousand have a flaky run over the course of a week. While this represents less than 2% of our tests, it still causes significant drag on our engineers.
If we want to fix our flaky tests (and avoid writing new ones) we need to understand them. At Google, we collect lots of data on our tests: execution times, test types, run flags, and consumed resources. I've studied how some of this data correlates with flaky tests and believe this research can lead us to better, more stable testing practices. Overwhelmingly, the larger the test (as measured by binary size, RAM use, or number of libraries built), the more likely it is to be flaky. The rest of this post will discuss some of my findings.
For a previous discussion of our flaky tests, see John Micco's postfrom May 2016.
Test size - Large tests are more likely to be flaky

We categorize our tests into three general sizes: small, medium and large. Every test has a size, but the choice of label is subjective. The engineer chooses the size when they initially write the test, and the size is not always updated as the test changes. For some tests it doesn't reflect the nature of the test anymore. Nonetheless, it has some predictive value. Over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky [1]. There's a clear increase in flakiness from small to medium and from medium to large. But this still leaves open a lot of questions. There's only so much we can learn looking at three sizes.
The larger the test, the more likely it will be flaky

There are some objective measures of size we collect: test binary size and RAM used when running the test [2]. For these two metrics, I grouped tests into equal-sized buckets [3] and calculated the percentage of tests in each bucket that were flaky. The numbers below are the r2 values of the linear best fit [4].

Correlation between metric and likelihood of test being flaky
Metric r2
Binary size 0.82
RAM used 0.76


The tests that I'm looking at are (for the most part) hermetic tests that provide a pass/fail signal. Binary size and RAM use correlated quite well when looking across our tests and there's not much difference between them. So it's not just that large tests are likely to be flaky, it's that the larger the tests get, the more likely they are to be flaky.

I have charted the full set of tests below for those two metrics. Flakiness increases with increases in binary size [5], but we also see increasing linear fit residuals [6] at larger sizes.


The RAM use chart below has a clearer progression and only starts showing large residuals between the first and second vertical lines.



While the bucket sizes are constant, the number of tests in each bucket is different. The points on the right with larger residuals include much fewer tests than those on the left. If I take the smallest 96% of our tests (which ends just past the first vertical line) and then shrink the bucket size, I get a much stronger correlation (r2 is 0.94). It perhaps indicates that RAM and binary size are much better predictors than the overall charts show.



Certain tools correlate with a higher rate of flaky tests
Some tools get blamed for being the cause of flaky tests. For example, WebDriver tests (whether written in Java, Python, or JavaScript) have a reputation for being flaky [7]. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool.

Flakiness of tests using some of our common testing tools
Category % of tests that are flaky % of all flaky tests
All tests
1.65%
100%
Java WebDriver
10.45%
20.3%
Python WebDriver
18.72%
4.0%
An internal integration tool
14.94%
10.6%
Android emulator
25.46%
11.9%


All of these tools have higher than average flakiness. And given that 1 in 5 of our flaky tests are Java WebDriver tests, I can understand why people complain about them. But correlation is not causation, and given our results from the previous section, there might be something other than the tool causing the increased rate of flakiness.
Size is more predictive than tool

We can combine tool choice and test size to see which is more important. For each tool above, I isolated tests that use the tool and bucketed those based on memory usage (RAM) and binary size, similar to my previous approach. I calculated the line of best fit and how well it correlated with the data (r2). I then computed the predicted likelihood a test would be flaky at the smallest bucket [8] (which is already the 48th percentile of all our tests) as well as the 90th and 95th percentile of RAM used.
Predicted flaky likelihood by RAM and tool
Category r2 Smallest bucket
(48th percentile)
90th percentile 95th percentile
All tests 0.76 1.5% 5.3% 9.2%
Java WebDriver 0.70 2.6% 6.8% 11%
Python WebDriver 0.65 -2.0% 2.4% 6.8%
An internal integration tool 0.80 -1.9% 3.1% 8.1%
Android emulator 0.45 7.1% 12% 17%


This table shows the results of these calculations for RAM. The correlation is stronger for the tools other than Android emulator. If we ignore that tool, the difference in correlations between tools for similar RAM use are around 4-5%. The differences from the smallest test to the 95th percentile for the tests are 8-10%. This is one of the most useful outcomes from this research: tools have some impact, but RAM use accounts for larger deviations in flakiness.
Predicted flaky likelihood by binary sizeand tool

Category r2 Smallest bucket
(33rd percentile)
90th percentile 95th percentile
All tests 0.82 -4.4% 4.5% 9.0%
Java WebDriver 0.81 -0.7% 14% 21%
Python WebDriver 0.61 -0.9% 11% 17%
An internal integration tool 0.80 -1.8% 10% 17%
Android emulator 0.05 18% 23% 25%


There's virtually no correlation between binary size and flakiness for Android emulator tests. For the other tools, you see greater variation in predicted flakiness between the small tests and large tests compared to RAM; up to 12% points. But you also see wider differences from the smallest size to the largest; 22% at the max. This is similar to what we saw with RAM use and another of the most useful outcomes of this research: binary size accounts for larger deviations in flakiness than the tool you use.
Conclusions

Engineer-selected test size correlates with flakiness, but within Google there are not enough test size options to be particularly useful.
Objectively measured test binary size and RAM have strong correlations with whether a test is flaky. This is a continuous function rather than a step function. A step function would have sudden jumps and could indicate that we're transitioning from one type of test to another at those points (e.g. unit tests to system tests or system tests to integration tests).
Tests written with certain tools exhibit a higher rate of flakiness. But much of that can be explained by the generally larger size of these tests. The tool itself seems to contribute only a small amount to this difference.
We need to be more careful before we decide to write large tests. Think about what code you are testing and what a minimal test would look like. And we need to be careful as we write large tests. Without additional effort aimed at preventing flakiness, there's is a strong likelihood you will have flaky tests that require maintenance.
Footnotes
  1. A test was flaky if it had at least one flaky run during the week.
  2. I also considered number of libraries built to create the test. In a 1% sample of tests, binary size (0.39) and RAM use (0.34) had stronger correlations than number of libraries (0.27). I only studied binary size and RAM use moving forward.
  3. I aimed for around 100 buckets for each metric.
  4. r2 measures how closely the line of best fit matches the data. A value of 1 means the line matches the data exactly.
  5. There are two interesting areas where the points actually reverse their upward slope. The first starts about halfway to the first vertical line and lasts for a few data points and the second goes from right before the first vertical line to right after. The sample size is large enough here that it's unlikely to just be random noise. There are clumps of tests around these points that are more or less flaky than I'd expect only considering binary size. This is an opportunity for further study.
  6. Distance from the observed point and the line of best fit.
  7. Other web testing tools get blamed as well, but WebDriver is our most commonly used one.
  8. Some of the predicted flakiness percents for the smallest buckets end up being negative. While we can't have a negative percent of tests be flaky, it is a possible outcome using this type of prediction.