Category Archives: Google Testing Blog

If it ain’t broke, you’re not trying hard enough

GTAC Diversity Scholarship



by Lesley Katzen on behalf of the GTAC Diversity Committee


We are committed to increasing diversity at GTAC, and we believe the best way to do that is by making sure we have a diverse set of applicants to speak and attend. As part of that commitment, we are excited to announce that we will be offering travel scholarships again this year.
Travel scholarships will be available for selected applicants from traditionally underrepresented groups in technology.

To be eligible for a grant to attend GTAC, applicants must:
  • Be 18 years of age or older.
  • Be from a traditionally underrepresented group in technology.
  • Work or study in Computer Science, Computer Engineering, Information Technology, or a technical field related to software testing.
  • Be able to attend core dates of GTAC, November 14th - 15th 2017 in London, England.
To apply:
You must fill out the following scholarship formand register for GTAC to be considered for a travel scholarship.
The deadline for submission is July 1st. Scholarship recipients will be announced on August 15th. If you are selected, we will contact you with information on how to proceed with booking travel.

What the scholarship covers:
Google will pay for round-trip standard coach class airfare to London for selected scholarship recipients, and 3 nights of accommodations in a hotel near the Google King's Cross campus. Breakfast and lunch will be provided for GTAC attendees and speakers on both days of the conference. We will also provide a £75.00 gift card for other incidentals such as airport transportation or meals. You will need to provide your own credit card to cover any hotel incidentals.

Google is dedicated to providing a harassment-free and inclusive conference experience for everyone. Our anti-harassment policy can be found at:
https://www.google.com/events/policy/anti-harassmentpolicy.html

GTAC 2017 – Registration is open!

by Diego Cavalcanti on behalf of the GTAC 2017 Committee
The Google Test Automation Conference (GTAC) is an annual test automation conference hosted by Google. It brings together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It is a great opportunity to present, learn, and challenge modern testing technologies and strategies.

We are pleased to announce that this year, GTAC will be held in Google's London office on November 14th and 15th, 2017.

Registration is currently OPEN for attendees and speakers. See more information here.

The schedule for the upcoming months is as follows:
  • May 15, 2017 - Registration opens for speakers and attendees, including applicants for the diversity scholarship.
  • July 1, 2017 - Registration closes for speaker submissions.
  • July 15, 2017 - Registration closes for attendee submissions.
  • August 15, 2017 - Selected speakers and attendees will be notified.
  • November 13, 2017 - Rehearsal day for speakers (not open for attendees).
  • November 14-15, 2017 - GTAC 2017!
As part of our efforts to increase diversity of speakers and attendees at GTAC, we will again be offering travel scholarships for selected applicants from traditionally underrepresented groups in technology. Please find more information here.

Please do not hesitate to contact gtac2017@google.com if you have any questions. We look forward to seeing you in London!

OSS-Fuzz: Five Months Later, and Rewarding Projects

By Oliver Chang, Abhishek Arya (Security Engineers, Chrome Security), Kostya Serebryany (Software Engineer, Dynamic Tools), and Josh Armour (Security Program Manager)

Five months ago, we announcedOSS-Fuzz, Google's effort to help make open source software more secure and stable. Since then, our robot army has been working hard at fuzzing, processing 10 trillion test inputs a day. Thanks to the efforts of the open source community who have integrated a total of 47 projects, we've found over 1,000bugs (264of which are potential security vulnerabilities).
Breakdown of the types of bugs we're finding

Notable results

OSS-Fuzz has found numerous security vulnerabilities in several critical open source projects: 10in FreeType2, 17in FFmpeg, 33in LibreOffice, 8in SQLite 3, 10in GnuTLS, 25in PCRE2, 9in gRPC, and 7in Wireshark. We've also had at least one bug collision with another independent security researcher (CVE-2017-2801). (Some of the bugs are still view-restricted so links may show smaller numbers.)

Once a project is integrated into OSS-Fuzz, the continuous and automated nature of OSS-Fuzz means that we often catch these issues just hours after the regression is introduced into the upstream repository, so that the chances of users being affected is reduced.

Fuzzing not only finds memory safety related bugs, it can also find correctness or logic bugs. One example is a carry propagating bug in OpenSSL (CVE-2017-3732).

Finally, OSS-Fuzz has reported over 300 timeout and out-of-memory failures (~75% of which got fixed). Not every project treats these as bugs, but fixing them enables OSS-Fuzz to find more interesting bugs.

Announcing rewards for open source projects

We believe that user and internet security as a whole can benefit greatly if more open source projects include fuzzing in their development process. To this end, we'd like to encourage more projects to participate and adopt the ideal integration guidelines that we've established.

Combined with fixing all the issues that are found, this is often a significant amount of work for developers who may be working on an open source project in their spare time. To support these projects, we are expanding our existing Patch Rewardsprogram to include rewards for the integration of fuzz targets into OSS-Fuzz.

To qualify for these rewards, a project needs to have a large user base and/or be critical to global IT infrastructure. Eligible projects will receive $1,000 for initial integration, and up to $20,000 for ideal integration (the final amount is at our discretion). You have the option of donating these rewards to charity instead, and Google will double the amount.

To qualify for the ideal integration reward, projects must show that:
  • Fuzz targets are checked into their upstream repository and integrated in the build system with sanitizer support (up to $5,000).
  • Fuzz targets are efficientand provide good code coverage (>80%) (up to $5,000).
  • Fuzz targets are part of the official upstream development and regression testing process, i.e. they are maintained, run against old known crashers and the periodically updated corpora(up to $5,000).
  • The last $5,000 is a "l33t" bonus that we may reward at our discretion for projects that we feel have gone the extra mile or done something really awesome.
We've already started to contact the first round of projects that are eligible for the initial reward. If you are the maintainer or point of contact for one of these projects, you may also reach out to us in order to apply for our ideal integration rewards.

The future

We'd like to thank the existing contributors who integrated their projects and fixed countless bugs. We hope to see more projects integrated into OSS-Fuzz, and greater adoption of fuzzing as standard practice when developing software.

Where do our flaky tests come from?

author: Jeff Listfield

When tests fail on code that was previously tested, this is a strong signal that something is newly wrong with the code. Before, the tests passed and the code was correct; now the tests fail and the code is not working right. The goal of a good test suite is to make this signal as clear and directed as possible.

Flaky (nondeterministic) tests, however, are different. Flaky tests are tests that exhibit both a passing and a failing result with the same code. Given this, a test failure may or may not mean that there's a new problem. And trying to recreate the failure, by rerunning the test with the same version of code, may or may not result in a passing test. We start viewing these tests as unreliable and eventually they lose their value. If the root cause is nondeterminism in the production code, ignoring the test means ignoring a production bug.
Flaky Tests at Google

Google has around 4.2 million tests that run on our continuous integration system. Of these, around 63 thousand have a flaky run over the course of a week. While this represents less than 2% of our tests, it still causes significant drag on our engineers.
If we want to fix our flaky tests (and avoid writing new ones) we need to understand them. At Google, we collect lots of data on our tests: execution times, test types, run flags, and consumed resources. I've studied how some of this data correlates with flaky tests and believe this research can lead us to better, more stable testing practices. Overwhelmingly, the larger the test (as measured by binary size, RAM use, or number of libraries built), the more likely it is to be flaky. The rest of this post will discuss some of my findings.
For a previous discussion of our flaky tests, see John Micco's postfrom May 2016.
Test size - Large tests are more likely to be flaky

We categorize our tests into three general sizes: small, medium and large. Every test has a size, but the choice of label is subjective. The engineer chooses the size when they initially write the test, and the size is not always updated as the test changes. For some tests it doesn't reflect the nature of the test anymore. Nonetheless, it has some predictive value. Over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky [1]. There's a clear increase in flakiness from small to medium and from medium to large. But this still leaves open a lot of questions. There's only so much we can learn looking at three sizes.
The larger the test, the more likely it will be flaky

There are some objective measures of size we collect: test binary size and RAM used when running the test [2]. For these two metrics, I grouped tests into equal-sized buckets [3] and calculated the percentage of tests in each bucket that were flaky. The numbers below are the r2 values of the linear best fit [4].

Correlation between metric and likelihood of test being flaky
Metric r2
Binary size 0.82
RAM used 0.76


The tests that I'm looking at are (for the most part) hermetic tests that provide a pass/fail signal. Binary size and RAM use correlated quite well when looking across our tests and there's not much difference between them. So it's not just that large tests are likely to be flaky, it's that the larger the tests get, the more likely they are to be flaky.

I have charted the full set of tests below for those two metrics. Flakiness increases with increases in binary size [5], but we also see increasing linear fit residuals [6] at larger sizes.


The RAM use chart below has a clearer progression and only starts showing large residuals between the first and second vertical lines.



While the bucket sizes are constant, the number of tests in each bucket is different. The points on the right with larger residuals include much fewer tests than those on the left. If I take the smallest 96% of our tests (which ends just past the first vertical line) and then shrink the bucket size, I get a much stronger correlation (r2 is 0.94). It perhaps indicates that RAM and binary size are much better predictors than the overall charts show.



Certain tools correlate with a higher rate of flaky tests
Some tools get blamed for being the cause of flaky tests. For example, WebDriver tests (whether written in Java, Python, or JavaScript) have a reputation for being flaky [7]. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool.

Flakiness of tests using some of our common testing tools
Category % of tests that are flaky % of all flaky tests
All tests
1.65%
100%
Java WebDriver
10.45%
20.3%
Python WebDriver
18.72%
4.0%
An internal integration tool
14.94%
10.6%
Android emulator
25.46%
11.9%


All of these tools have higher than average flakiness. And given that 1 in 5 of our flaky tests are Java WebDriver tests, I can understand why people complain about them. But correlation is not causation, and given our results from the previous section, there might be something other than the tool causing the increased rate of flakiness.
Size is more predictive than tool

We can combine tool choice and test size to see which is more important. For each tool above, I isolated tests that use the tool and bucketed those based on memory usage (RAM) and binary size, similar to my previous approach. I calculated the line of best fit and how well it correlated with the data (r2). I then computed the predicted likelihood a test would be flaky at the smallest bucket [8] (which is already the 48th percentile of all our tests) as well as the 90th and 95th percentile of RAM used.
Predicted flaky likelihood by RAM and tool
Category r2 Smallest bucket
(48th percentile)
90th percentile 95th percentile
All tests 0.76 1.5% 5.3% 9.2%
Java WebDriver 0.70 2.6% 6.8% 11%
Python WebDriver 0.65 -2.0% 2.4% 6.8%
An internal integration tool 0.80 -1.9% 3.1% 8.1%
Android emulator 0.45 7.1% 12% 17%


This table shows the results of these calculations for RAM. The correlation is stronger for the tools other than Android emulator. If we ignore that tool, the difference in correlations between tools for similar RAM use are around 4-5%. The differences from the smallest test to the 95th percentile for the tests are 8-10%. This is one of the most useful outcomes from this research: tools have some impact, but RAM use accounts for larger deviations in flakiness.
Predicted flaky likelihood by binary sizeand tool

Category r2 Smallest bucket
(33rd percentile)
90th percentile 95th percentile
All tests 0.82 -4.4% 4.5% 9.0%
Java WebDriver 0.81 -0.7% 14% 21%
Python WebDriver 0.61 -0.9% 11% 17%
An internal integration tool 0.80 -1.8% 10% 17%
Android emulator 0.05 18% 23% 25%


There's virtually no correlation between binary size and flakiness for Android emulator tests. For the other tools, you see greater variation in predicted flakiness between the small tests and large tests compared to RAM; up to 12% points. But you also see wider differences from the smallest size to the largest; 22% at the max. This is similar to what we saw with RAM use and another of the most useful outcomes of this research: binary size accounts for larger deviations in flakiness than the tool you use.
Conclusions

Engineer-selected test size correlates with flakiness, but within Google there are not enough test size options to be particularly useful.
Objectively measured test binary size and RAM have strong correlations with whether a test is flaky. This is a continuous function rather than a step function. A step function would have sudden jumps and could indicate that we're transitioning from one type of test to another at those points (e.g. unit tests to system tests or system tests to integration tests).
Tests written with certain tools exhibit a higher rate of flakiness. But much of that can be explained by the generally larger size of these tests. The tool itself seems to contribute only a small amount to this difference.
We need to be more careful before we decide to write large tests. Think about what code you are testing and what a minimal test would look like. And we need to be careful as we write large tests. Without additional effort aimed at preventing flakiness, there's is a strong likelihood you will have flaky tests that require maintenance.
Footnotes
  1. A test was flaky if it had at least one flaky run during the week.
  2. I also considered number of libraries built to create the test. In a 1% sample of tests, binary size (0.39) and RAM use (0.34) had stronger correlations than number of libraries (0.27). I only studied binary size and RAM use moving forward.
  3. I aimed for around 100 buckets for each metric.
  4. r2 measures how closely the line of best fit matches the data. A value of 1 means the line matches the data exactly.
  5. There are two interesting areas where the points actually reverse their upward slope. The first starts about halfway to the first vertical line and lasts for a few data points and the second goes from right before the first vertical line to right after. The sample size is large enough here that it's unlikely to just be random noise. There are clumps of tests around these points that are more or less flaky than I'd expect only considering binary size. This is an opportunity for further study.
  6. Distance from the observed point and the line of best fit.
  7. Other web testing tools get blamed as well, but WebDriver is our most commonly used one.
  8. Some of the predicted flakiness percents for the smallest buckets end up being negative. While we can't have a negative percent of tests be flaky, it is a possible outcome using this type of prediction.

Code Health: Google’s Internal Code Quality Efforts

By Max Kanat-Alexander, Tech Lead for Code Health and Author of Code Simplicity

There are many aspects of good coding practices that don't fall under the normal areas of testing and tooling that most Engineering Productivity groups focus on in the software industry. For example, having readable and maintainable code is about more than just writing good tests or having the right tools—it's about having code that can be easily understood and modified in the first place. But how do you make sure that engineers follow these practices while still allowing them the independence that they need to make sound engineering decisions?

Many years ago, a group of Googlers came together to work on this problem, and they called themselves the "Code Health" group. Why "Code Health"? Well, many of the other terms used for this in the industry—engineering productivity, best practices, coding standards, code quality—have connotations that could lead somebody to think we were working on something other than what we wanted to focus on. What we cared about was the processes and practices of software engineering in full—any aspect of how software was written that could influence the readability, maintainability, stability, or simplicity of code. We liked the analogy of having "healthy" code as covering all of these areas.

This is a field that many authors, theorists, and conference speakers touch on, but not an area that usually has dedicated resources within engineering organizations. Instead, in most software companies, these efforts are pushed by a few dedicated engineers in their extra time or led by the senior tech leads. However, every software engineer is actually involved in code health in some way. After all, we all write software, and most of us care deeply about doing it the "right way." So why not start a group that helps engineers with that "right way" of doing things?

This isn't to say that we are prescriptive about engineering practices at Google. We still let engineers make the decisions that are most sensible for their projects. What the Code Health group does is work on efforts that universally improve the lives of engineers and their ability to write products with shorter iteration time, decreased development effort, greater stability, and improved performance. Everybody appreciates their code getting easier to understand, their libraries getting simpler, etc. because we all know those things let us move faster and make better products.

But how do we accomplish all of this? Well, at Google, Code Health efforts come in many forms.

There is a Google-wide Code Health Group composed of 20%contributors who work to make engineering at Google better for everyone. The members of this group maintain internal documents on best practices and act as a sounding board for teams and individuals who wonder how best to improve practices in their area. Once in a while, for critical projects, members of the group get directly involved in refactoring code, improving libraries, or making changes to tools that promote code health.

For example, this central group maintains Google's code review guidelines, writes internal publications about best practices, organizes tech talks on productivity improvements, and generally fosters a culture of great software engineering at Google.

Some of the senior members of the Code Health group also advise engineering executives and internal leadership groups on how to improve engineering practices in their areas. It's not always clear how to implement effective code health practices in an area—some people have more experience than others making this happen broadly in teams, and so we offer our consulting and experience to help make simple code and great developer experiences a reality.

In addition to the central group, many products and teams at Google have their own Code Health group. These groups tend to work more closely on actual coding projects, such as addressing technical debt through refactoring, making tools that detect and prevent bad coding practices, creating automated code formatters, or making systems for automatically deleting unused code. Usually these groups coordinate and meet with the central Code Health group to make sure that we aren't duplicating efforts across the company and so that great new tools and systems can be shared with the rest of Google.

Throughout the years, Google's Code Health teams have had a major impact on the ability of engineers to develop great products quickly at Google. But code complexity isn't an issue that only affects Google—it affects everybody who writes software, from one person writing software on their own time to the largest engineering teams in the world. So in order to help out everybody, we're planning to release articles in the coming weeks and months that detail specific practices that we encourage internally—practices that can be applied everywhere to help your company, your codebase, your team, and you. Stay tuned here on the Google Testing Blog for more Code Health articles coming soon!

Discomfort as a Tool for Change

by Dave Gladfelter (SETI, Google Drive)

Introduction

The SETI (Software Engineer, Tools and Infrastructure) role at Google is a strange one in that there's no obvious reason why it should exist. The SWEs (Software Engineers) on a project understand its problems best, and understanding a problem is most of the way to fixing it. How can SETIs bring unique value to a project when SWEs have more on-the-ground experience with their impediments?

The answer is scope. A SWE is rewarded for being an expert in their particular area and domain and is highly motivated to make optimizations to their carved-out space. SETIs (and Test Engineers and EngProdin general) identify and solve product-wide problems.

Product-wide problems frequently arise because local optimizations don't necessarily add up to product-wide optimizations. The reason may be the limits of attention, blind spots, or mis-aligned incentives, but a group of SWEs each optimizing for their own sub-projects will not achieve product-wide maxima.

Often SETIs and Test Engineers (TEs) know what behavior they'd like to see, such as more integration tests. We may even have management's ear and convince them to mandate such tests. However, in the absence of incentives, it's unlikely that the decisions SWEs make in response to such mandates will add up to the behavior we desire. Mandates around methods/practices are often ineffective. For example, a mandate of documentation for each public method on an interface often results in "method foo does foo."

The best way to create product-wide efficiencies is to change the way the team or process works in ways that will (initially) be uncomfortable for the engineering team, but that pays dividends that can't be achieved any other way. SETIs and TEs must work to identify the blind spots and negative interactions between engineering teams and change the environment in ways that align engineering teams' incentives. When properly incentivized, SWEs will make optimal decisions enhanced by product-wide vision rather than micro-management.

Common Product-Wide Problems

Hard-to-use APIs

One common example of local optimizations resulting in cross-team de-optimization is documentation and ease-of-use of internal APIs. The team that implements an internal API is not rewarded for making it easy to use except in the most oblique ways. Clients are compelled to use the internal APIs provided to them, so the API owner has a monopoly and will set the price of using it at "you must read all the code and debug it yourself" in the absence of incentives or (rare) heroes.

Big, slow releases

Another example is large and slow releases. Without EngProd help or external pressure, teams will gravitate to the slowest, biggest release possible.

This makes sense from the position of any individual SWE: releases are painful, you have to ensure that there are no UI and API regressions, watch traffic and error rates for some time, and re-learn and use tools and processes that are complex and specific to releases.

Multiple teams will naturally gravitate to having one big release so that all of these costs can be bundled into one operation for "efficiency." The result is that engineers don't get feedback on features for weeks and versioning of APIs and data stores is ignored (since all the parts of the system are bundled together into one big release). This greatly slows down developer and feature velocity and greatly increases risks of cascading failures when the release fails.

How EngProd fixes product-wide problems

SETIs can nibble around the edges of these kinds of problems by writing tools and automation. TEs can create easy-to-use test environments that facilitate isolating and debugging faults in integration and ambiguities in APIs. We can use fancy technologies to sample live traffic and ensure that new versions of systems behave the same as previous versions. We can review design docs to ensure that they have an appropriate test plan. Often these actions do have real value. However, these are not the best way to align incentives to create a product-wide solution. Facilitating engineering teams' fruitful collaboration (and dis-incentivizing negative interactions) gives EngProd a multiplier that is hard to achieve with only tooling and automation.

Heroes are few and far between so we must turn to incentives, which is where discomfort comes in. Continuity is comfortable and change is painful. EngProd looks at how to change the problem so that teams are incentivized to work together fruitfully and disincentivized (discomforted) to pursue local optimizations exclusively.

So how does EngProd align incentives? Certainly there is a place for optimizing for optimal behaviors, such as easy-to-use integration environments. However, incentivizing optimal behaviors via negative feedback should not be overlooked. Each problem is different, so let's look at how to address the two examples above:

Incentivizing easy-to-use APIs

Engineers will make the things they're incentivized to make. For APIs, make teams incentivized to provide integration help in the form of fakes. EngProd works with team leads to ensure there are explicit objectives to provide Fakes for their APIs as part of the rollout.

Fakesare as-simple-as-possible implementations of a service that still can be used to do pre-submit testing of client interactions with the system. They don't replace integration tests, but they reduce the likelihood of finding errors in subsequent integration test runs by an order of magnitude.
Furthermore, have some subset of the same client-owned and server-owned tests run against the fakes (for quick presubmit testing) as well as the real implementation (for continuous integration testing) and work with management to make it the responsibility of the Fake owner to debug any discrepancies for either the client- or the server-owned tests.

This reverses the pain! API owners, who are in a position to make APIs better, are now the ones experiencing negative incentives when APIs are not easy to use. Previously, when clients felt the pain, they had no recourse other than to file easily-ignored bugs ("Closed: working as intended") or contribute changes to the API owners' codebase, hurting their own performance with distractions.

This will incentivize API owners to design APIs to be as simple as possible with as few side-effects as possible, and to provide high-quality fakes that make it easy for clients to integrate with the API. Some teams will certainly not like this change at first, but I have seen API teams come to the realization that this is the best choice for the larger effort and implement these practices despite their cost to the team in the short run.

Helping management set engineering team objectives may not seem like a typical SETI responsibility, but although management is responsible for setting performance incentives and objectives, they are not well-positioned to understand how the low-level decisions of different teams create harmful interactions and lower cross-team performance, so they need SETI and TE guidance to create an environment that encourages optimal behaviors.

Fast, small releases

Being forced to release more frequently than is required by feature deployment requirements has many beneficial side-effects that make release velocity a goal unto itself. SETIs and TEs faced with big, slow releases work with management to mandate a move to a set of smaller, more frequent releases. As release velocity is ratcheted up, negative behaviours such as too much manual testing or too much internal coupling become more painful, and many optimal behaviors are incentivized.

Less coupling between systems

When software is released together, it is easy to treat the seams between different components as implementation details. Resulting systems becoming so intertwined (coupled) that responsibilities between them are completely and randomly mixed and their interactions are too complex for any one person to understand. When two components are released separately and at different times, different versions of them must be compatible with one another. Engineers who were previously complacent about this fragility will become fearful of failed releases due to implicit contract changes. They will change their behavior in beneficial ways such as defining the contract between components explicitly and creating regression testing for it. The result is a system composed of robust, self-contained, more easily understood components.

Better/More automated testing

Manual testing becomes more painful as release velocity is ramped up. This will incentivize automated regression, UI and performance tests. This makes the team more agile and able to catch defects sooner and more cheaply.

Faster feedback

When incremental feature changes can be released to dogfood or other beta channels more frequently, user interaction designers and product managers get much faster feedback about what paths lead to better user engagement and experience than in big, slow releases where an entire feature is deployed simultaneously. This results in a better product.

Conclusion

The SETIs and TEs optimize interactions between teams and create fixes for product-wide, cross-team problems in order to improve engineering productivity and velocity. There are many worthwhile projects that EngProd can do using broad knowledge of the system and expertise in refactoring, automation and testing, such as creating test fixtures that enable continuous integration testing or identifying and combining duplicative tests or tools.

That said, the biggest problem that EngProd is positioned to solve is to break the chain of local optimizations resulting in cross-team de-optimizations. To that end, discomfort is a tool that can incentivize engineers to find solutions that are optimal for the entire product. We should look for and advocate for these transformative changes.

Testing on the Toilet: Keep Cause and Effect Clear

by Ben Yu

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


Can you tell if this test is correct?
208: @Test public void testIncrement_existingKey() {
209: assertEquals(9, tally.get("key1"));
210: }

It’s impossible to know without seeing how the tally object is set up:
1:   private final Tally tally = new Tally();
2: @Before public void setUp() {
3: tally.increment("key1", 8);
4: tally.increment("key2", 100);
5: tally.increment("key1", 0);
6: tally.increment("key1", 1);
7: }
// 200 lines away
208: @Test public void testIncrement_existingKey() {
209: assertEquals(9, tally.get("key1"));
210: }

The problem is that the modification of key1's values occurs 200+ lines away from the assertion. Otherwise put, the cause is hidden far away from the effect.

Instead, write tests where the effects immediately follow the causes. It's how we speak in natural language: “If you drive over the speed limit (cause), you’ll get a traffic ticket (effect).” Once we group the two chunks of code, we easily see what’s going on:
1:   private final Tally tally = new Tally();
2: @Test public void testIncrement_newKey() {
3: tally.increment("key", 100);
5: assertEquals(100, tally.get("key"));
6: }
7: @Test public void testIncrement_existingKey() {
8: tally.increment("key", 8);
9: tally.increment("key", 1);
10: assertEquals(9, tally.get("key"));
11: }
12: @Test public void testIncrement_incrementByZeroDoesNothing() {
13: tally.increment("key", 8);
14: tally.increment("key", 0);
15: assertEquals(8, tally.get("key"));
16: }

This style may require a bit more code. Each test sets its own input and verifies its own expected output. The payback is in more readable code and lower maintenance costs.

Happy 10th Birthday Google Testing Blog!

by Anthony Vallone

Ten years ago today, the first Google Testing Blog article was posted (official announcement 2 days later). Over the years, Google engineers have used this blog to help advance the test engineering discipline. We have shared information about our testing technologies, strategies, and theories; discussed what code quality really means; described how our teams are organized for optimal productivity; announced new tooling; and invited readers to speak at and attend the annual Google Test Automation Conference.

Google Testing Blog banner in 2007


The blog has enjoyed excellent readership. There have been over 10 million page views of the blog since it was created, and there are currently about 100 to 200 thousand views per month.

This blog is made possible by many Google engineers who have volunteered time to author and review content on a regular basis in the interest of sharing. Thank you to all the contributors and our readers!

Please leave a comment if you have a story to share about how this blog has helped you.

Announcing OSS-Fuzz: Continuous Fuzzing for Open Source Software

By Mike Aizatsky, Kostya Serebryany (Software Engineers, Dynamic Tools); Oliver Chang, Abhishek Arya (Security Engineers, Google Chrome); and Meredith Whittaker (Open Research Lead). 

We are happy to announce OSS-Fuzz, a new Beta program developed over the past years with the Core Infrastructure Initiative community. This program will provide continuous fuzzing for select core open source software.

Open source software is the backbone of the many apps, sites, services, and networked things that make up "the internet." It is important that the open source foundation be stable, secure, and reliable, as cracks and weaknesses impact all who build on it.

Recent security storiesconfirm that errors likebuffer overflow anduse-after-free can have serious, widespread consequences when they occur in critical open source software. These errors are not only serious, but notoriously difficult to find via routine code audits, even for experienced developers. That's wherefuzz testing comes in. By generating random inputs to a given program, fuzzing triggers and helps uncover errors quickly and thoroughly.

In recent years, several efficient general purpose fuzzing engines have been implemented (e.g. AFL and libFuzzer), and we use them to fuzz various components of the Chrome browser. These fuzzers, when combined with Sanitizers, can help find security vulnerabilities (e.g. buffer overflows, use-after-free, bad casts, integer overflows, etc), stability bugs (e.g. null dereferences, memory leaks, out-of-memory, assertion failures, etc) and sometimeseven logical bugs.

OSS-Fuzz's goal is to make common software infrastructure more secure and stable by combining modern fuzzing techniques with scalable distributed execution. OSS-Fuzz combines various fuzzing engines (initially, libFuzzer) with Sanitizers (initially, AddressSanitizer) and provides a massive distributed execution environment powered by ClusterFuzz.

Early successes

Our initial trials with OSS-Fuzz have had good results. An example is the FreeType library, which is used on over a billion devices to display text (and which might even be rendering the characters you are reading now). It is important for FreeType to be stable and secure in an age when fonts are loaded over the Internet. Werner Lemberg, one of the FreeType developers, wasan early adopter of OSS-Fuzz. Recently the FreeType fuzzer found a new heap buffer overflow only a few hours after the source change:

ERROR: AddressSanitizer: heap-buffer-overflow on address 0x615000000ffa 
READ of size 2 at 0x615000000ffa thread T0
SCARINESS: 24 (2-byte-read-heap-buffer-overflow-far-from-bounds)
#0 0x885e06 in tt_face_vary_cvtsrc/truetype/ttgxvar.c:1556:31

OSS-Fuzz automatically notifiedthe maintainer, whofixed the bug; then OSS-Fuzz automaticallyconfirmed the fix. All in one day! You can see the full list of fixed and disclosed bugs found by OSS-Fuzz so far.

Contributions and feedback are welcome

OSS-Fuzz has already found 150 bugs in several widely used open source projects (and churns ~4 trillion test cases a week). With your help, we can make fuzzing a standard part of open source development, and work with the broader community of developers and security testers to ensure that bugs in critical open source applications, libraries, and APIs are discovered and fixed. We believe that this approach to automated security testing will result in real improvements to the security and stability of open source software.

OSS-Fuzz is launching in Beta right now, and will be accepting suggestions for candidate open source projects. In order for a project to be accepted to OSS-Fuzz, it needs to have a large user base and/or be critical to Global IT infrastructure, a general heuristic that we are intentionally leaving open to interpretation at this early stage. See more details and instructions on how to apply here.

Once a project is signed up for OSS-Fuzz, it is automatically subject to the 90-day disclosure deadline for newly reported bugs in our tracker (see details here). This matches industry's best practices and improves end-user security and stability by getting patches to users faster.

Help us ensure this program is truly serving the open source community and the internet which relies on this critical software, contribute and leave your feedback on GitHub.

What Test Engineers do at Google: Building Test Infrastructure

Author: Jochen Wuttke

In a recent post, we broadly talked about What Test Engineers do at Google. In this post, I talk about one aspect of the work TEs may do: building and improving test infrastructure to make engineers more productive.

Refurbishing legacy systems makes new tools necessary
A few years ago, I joined an engineering team that was working on replacing a legacy system with a new implementation. Because building the replacement would take several years, we had to keep the legacy system operational and even add features, while building the replacement so there would be no impact on our external users.

The legacy system was so complex and brittle that the engineers spent most of their time triaging and fixing bugs and flaky tests, but had little time to implement new features. The goal for the rewrite was to learn from the legacy system and to build something that was easier to maintain and extend. As the team's TE, my job was to understand what caused the high maintenance cost and how to improve on it. I found two main causes:
  • Tight coupling and insufficient abstraction made unit testing very hard, and as a consequence, a lot of end-to-end tests served as functional tests of that code.
  • The infrastructure used for the end-to-end tests had no good way to create and inject fakes or mocks for these services. As a result, the tests had to run the large number of servers for all these external dependencies. This led to very large and brittle tests that our existing test execution infrastructure was not able to handle reliably.
Exploring solutions
At first, I explored if I could split the large tests into smaller ones that would test specific functionality and depend on fewer external services. This proved impossible, because of the poorly structured legacy code. Making this approach work would have required refactoring the entire system and its dependencies, not just the parts my team owned.

In my second approach, I also focussed on large tests and tried to mock services that were not required for the functionality under test. This also proved very difficult, because dependencies changed often and individual dependencies were hard to trace in a graph of over 200 services. Ultimately, this approach just shifted the required effort from maintaining test code to maintaining test dependencies and mocks.

My third and final approach, illustrated in the figure below, made small tests more powerful. In the typical end-to-end test we faced, the client made RPCcalls to several services, which in turn made RPC calls to other services. Together the client and the transitive closure over all backend services formed a large graph (not tree!) of dependencies, which all had to be up and running for the end-to-end test. The new model changes how we test client and service integration. Instead of running the client on inputs that will somehow trigger RPC calls, we write unit tests for the code making method calls to the RPC stub. The stub itself is mocked with a common mocking framework like Mockito in Java. For each such test, a second test verifies that the data used to drive that mock "makes sense" to the actual service. This is also done with a unit test, where a replay client uses the same data the RPC mock uses to call the RPC handler method of the service.


This pattern of integration testing applies to any RPC call, so the RPC calls made by a backend server to another backend can be tested just as well as front-end client calls. When we apply this approach consistently, we benefit from smaller tests that still test correct integration behavior, and make sure that the behavior we are testing is "real".

To arrive at this solution, I had to build, evaluate, and discard several prototypes. While it took a day to build a proof-of-concept for this approach, it took me and another engineer a year to implement a finished tool developers could use.

Adoption
The engineers embraced the new solution very quickly when they saw that the new framework removes large amounts of boilerplate code from their tests. To further drive its adoption, I organized multi-day events with the engineering team where we focussed on migrating test cases. It took a few months to migrate all existing unit tests to the new framework, close gaps in coverage, and create the new tests that validate the mocks. Once we converted about 80% of the tests, we started comparing the efficacy of the new tests and the existing end-to-end tests.

The results are very good:
  • The new tests are as effective in finding bugs as the end-to-end tests are.
  • The new tests run in about 3 minutes instead of 30 minutes for the end-to-end tests.
  • The client side tests are 0% flaky. The verification tests are usually less flaky than the end-to-end tests, and never more.
Additionally, the new tests are unit tests, so you can run them in your IDE and step through them to debug. These results allowed us to run the end-to-end tests very rarely, only to detect misconfigurations of the interacting services, but not as functional tests.

Building and improving test infrastructure to help engineers be more productive is one of the many things test engineers do at Google. Running this project from requirements gathering all the way to a finished product gave me the opportunity to design and implement several prototypes, drive the full implementation of one solution, lead engineering teams to adoption of the new framework, and integrate feedback from engineers and actual measurements into the continuous refinement of the tool.