Category Archives: Google Testing Blog

If it ain’t broke, you’re not trying hard enough

Code Health: To Comment or Not to Comment?

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Dori Reuveni and Kevin Bourrillion

While reading code, often there is nothing more helpful than a well-placed comment. However, comments are not always good. Sometimes the need for a comment can be a sign that the code should be refactored.

Use a comment when it is infeasible to make your code self-explanatory. If you think you need a comment to explain what a piece of code does, first try one of the following:
  • Introduce an explaining variable.
    // Subtract discount from price.
    finalPrice = (numItems * itemPrice)
    - min(5, numItems) * itemPrice * 0.1;
    price = numItems * itemPrice;
    discount =
    min(5, numItems) * itemPrice * 0.1;
    finalPrice = price - discount;
  • Extract a method.
    // Filter offensive words.
    for (String word : words) { ... }
    filterOffensiveWords(words);
  • Use a more descriptive identifier name.
    int width = ...; // Width in pixels.
    int widthInPixels = ...;
  • Add a check in case your code has assumptions.
    // Safe since height is always > 0.
    return width / height;
    checkArgument(height > 0);
    return width / height;
There are cases where a comment can be helpful:
  • Reveal your intent: explain why the code does something (as opposed to what it does).
    // Compute once because it’s expensive.
  • Protect a well-meaning future editor from mistakenly “fixing” your code.
    // Create a new Foo instance because Foo is not thread-safe.
  • Clarification: a question that came up during code review or that readers of the code might have.
    // Note that order matters because...
  • Explain your rationale for what looks like a bad software engineering practice.
    @SuppressWarnings("unchecked") // The cast is safe because...
On the other hand, avoid comments that just repeat what the code does. These are just noise:
// Get all users.
userService.getAllUsers();
// Check if the name is empty.
if (name.isEmpty()) { ... }

Evolution of GTAC and Engineering Productivity

When Google first hosted GTAC in 2006, we didn’t know what to expect. We kicked off this conference with the intention to share our innovation in test automation, learn from others in the industry and connect with academia. Over the last decade we’ve had great participation and had the privilege to host GTAC in North America, Europe and Asia -- largely thanks to the many of you who spoke, participated and connected!

In the recent months, we’ve been taking a hard look at the discipline of Engineering Productivity as a logical next step in the evolution of test automation. In that same vein, we’re going to rethink what an Engineering Productivity focused conference should look like today.  As we pivot, we will be extending these changes to GTAC and because we expect changes in theme, content and format, we are canceling the upcoming event scheduled in London this November. We’ll be bringing the event back in 2018 with a fresh outlook and strategy.

While we know this may be disappointing for many of the folks who were looking forward to GTAC, we’re excited to come back with a new format which will serve this conference well in today’s environment.

Code Health: Too Many Comments on Your Code Reviews?

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Tom O'Neill


Code reviews can slow down an individual code change, but they’re also an opportunity to improve your code and learn from another intelligent, experienced engineer. How can you get the most out of them?
Aim to get most of your changes approved in the first round of review, with only minor comments. If your code reviews frequently require multiple rounds of comments, these tips can save you time.

Spend your reviewers’ time wisely—it’s a limited resource. If they’re catching issues that you could easily have caught yourself, you’re lowering the overall productivity of your team.
Before you send out the code review:
  • Re-evaluate your code: Don’t just send the review out as soon as the tests pass. Step back and try to rethink the whole thing—can the design be cleaned up? Especially if it’s late in the day, see if a better approach occurs to you the next morning. Although this step might slow down an individual code change, it will result long-term in greater average throughput.
  • Consider an informal design discussion: If there’s something you’re not sure about, pair program, talk face-to-face, or send an early diff and ask for a “pre-review” of the overall design.
  • Self-review the change: Try to look at the code as critically as possible from the standpoint of someone who doesn’t know anything about it. Your code review tool can give you a radically different view of your code than the IDE. This can easily save you a round trip.
  • Make the diff easy to understand: Multiple changes at once make the code harder to review. When you self-review, look for simple changes that reduce the size of the diff. For example, save significant refactoring or formatting changes for another code review.
  • Don’t hide important info in the submit message: Put it in the code as well. Someone reading the code later is unlikely to look at the submit message.
When you’re addressing code review comments:
  • Re-evaluate your code after addressing non-trivial comments: Take a step back and really look at the code with fresh eyes. Once you’ve made one set of changes, you can often find additional improvements that are enabled or suggested by those changes. Just as with any refactoring, it may take several steps to reach the best design.
  • Understand why the reviewer made each comment: If you don’t understand the reasoning behind a comment, don’t just make the change—seek out the reviewer and learn something new.
  • Answer the reviewer’s questions in the code: Don’t just reply—make the code easier to understand (e.g., improve a variable name, change a boolean to an enum) or add a comment. Someone else is going to have the same question later on.

Code Health: Reduce Nesting, Reduce Complexity

This is another post in our Code Health series. A version of this post originally appeared in Google bathrooms worldwide as a Google Testing on the Toilet episode. You can download a printer-friendly version to display in your office.

By Elliott Karpilovsky

Deeply nested code hurts readability and is error-prone. Try spotting the bug in the two versions of this code:

Code with too much nesting Code with less nesting
response = server.Call(request)

if response.GetStatus() == RPC.OK:
if response.GetAuthorizedUser():
if response.GetEnc() == 'utf-8':
if response.GetRows():
vals = [ParseRow(r) for r in
response.GetRows()]
avg = sum(vals) / len(vals)
return avg, vals
else:
raise EmptyError()
else:
raise AuthError('unauthorized')
else:
raise ValueError('wrong encoding')
else:
raise RpcError(response.GetStatus())
response = server.Call(request)

if response.GetStatus() != RPC.OK:
raise RpcError(response.GetStatus())

if not response.GetAuthorizedUser():
raise ValueError('wrong encoding')

if response.GetEnc() != 'utf-8':
raise AuthError('unauthorized')

if not response.GetRows():
raise EmptyError()

vals = [ParseRow(r) for r in
response.GetRows()]
avg = sum(vals) / len(vals)
return avg, vals


Answer: the "wrong encoding" and "unauthorized" errors are swapped. This bug is easier to see in the refactored version, since the checks occur right as the errors are handled.

The refactoring technique shown above is known as guard clauses. A guard clause checks a criterion and fails fast if it is not met. It decouples the computational logic from the error logic. By removing the cognitive gap between error checking and handling, it frees up mental processing power. As a result, the refactored version is much easier to read and maintain.

Here are some rules of thumb for reducing nesting in your code:
  • Keep conditional blocks short. It increases readability by keeping things local.
  • Consider refactoring when your loops and branches are more than 2 levels deep.
  • Think about moving nested logic into separate functions. For example, if you need to loop through a list of objects that each contain a list (such as a protocol buffer with repeated fields), you can define a function to process each object instead of using a double nested loop.
Reducing nesting results in more readable code, which leads to discoverable bugs, faster developer iteration, and increased stability. When you can, simplify!

GTAC Diversity Scholarship



by Lesley Katzen on behalf of the GTAC Diversity Committee


We are committed to increasing diversity at GTAC, and we believe the best way to do that is by making sure we have a diverse set of applicants to speak and attend. As part of that commitment, we are excited to announce that we will be offering travel scholarships again this year.
Travel scholarships will be available for selected applicants from traditionally underrepresented groups in technology.

To be eligible for a grant to attend GTAC, applicants must:
  • Be 18 years of age or older.
  • Be from a traditionally underrepresented group in technology.
  • Work or study in Computer Science, Computer Engineering, Information Technology, or a technical field related to software testing.
  • Be able to attend core dates of GTAC, November 14th - 15th 2017 in London, England.
To apply:
You must fill out the following scholarship formand register for GTAC to be considered for a travel scholarship.
The deadline for submission is July 1st. Scholarship recipients will be announced on August 15th. If you are selected, we will contact you with information on how to proceed with booking travel.

What the scholarship covers:
Google will pay for round-trip standard coach class airfare to London for selected scholarship recipients, and 3 nights of accommodations in a hotel near the Google King's Cross campus. Breakfast and lunch will be provided for GTAC attendees and speakers on both days of the conference. We will also provide a £75.00 gift card for other incidentals such as airport transportation or meals. You will need to provide your own credit card to cover any hotel incidentals.

Google is dedicated to providing a harassment-free and inclusive conference experience for everyone. Our anti-harassment policy can be found at:
https://www.google.com/events/policy/anti-harassmentpolicy.html

GTAC 2017 – Registration is open!

by Diego Cavalcanti on behalf of the GTAC 2017 Committee
The Google Test Automation Conference (GTAC) is an annual test automation conference hosted by Google. It brings together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It is a great opportunity to present, learn, and challenge modern testing technologies and strategies.

We are pleased to announce that this year, GTAC will be held in Google's London office on November 14th and 15th, 2017.

Registration is currently OPEN for attendees and speakers. See more information here.

The schedule for the upcoming months is as follows:
  • May 15, 2017 - Registration opens for speakers and attendees, including applicants for the diversity scholarship.
  • July 1, 2017 - Registration closes for speaker submissions.
  • July 15, 2017 - Registration closes for attendee submissions.
  • August 15, 2017 - Selected speakers and attendees will be notified.
  • November 13, 2017 - Rehearsal day for speakers (not open for attendees).
  • November 14-15, 2017 - GTAC 2017!
As part of our efforts to increase diversity of speakers and attendees at GTAC, we will again be offering travel scholarships for selected applicants from traditionally underrepresented groups in technology. Please find more information here.

Please do not hesitate to contact gtac2017@google.com if you have any questions. We look forward to seeing you in London!

OSS-Fuzz: Five Months Later, and Rewarding Projects

By Oliver Chang, Abhishek Arya (Security Engineers, Chrome Security), Kostya Serebryany (Software Engineer, Dynamic Tools), and Josh Armour (Security Program Manager)

Five months ago, we announcedOSS-Fuzz, Google's effort to help make open source software more secure and stable. Since then, our robot army has been working hard at fuzzing, processing 10 trillion test inputs a day. Thanks to the efforts of the open source community who have integrated a total of 47 projects, we've found over 1,000bugs (264of which are potential security vulnerabilities).
Breakdown of the types of bugs we're finding

Notable results

OSS-Fuzz has found numerous security vulnerabilities in several critical open source projects: 10in FreeType2, 17in FFmpeg, 33in LibreOffice, 8in SQLite 3, 10in GnuTLS, 25in PCRE2, 9in gRPC, and 7in Wireshark. We've also had at least one bug collision with another independent security researcher (CVE-2017-2801). (Some of the bugs are still view-restricted so links may show smaller numbers.)

Once a project is integrated into OSS-Fuzz, the continuous and automated nature of OSS-Fuzz means that we often catch these issues just hours after the regression is introduced into the upstream repository, so that the chances of users being affected is reduced.

Fuzzing not only finds memory safety related bugs, it can also find correctness or logic bugs. One example is a carry propagating bug in OpenSSL (CVE-2017-3732).

Finally, OSS-Fuzz has reported over 300 timeout and out-of-memory failures (~75% of which got fixed). Not every project treats these as bugs, but fixing them enables OSS-Fuzz to find more interesting bugs.

Announcing rewards for open source projects

We believe that user and internet security as a whole can benefit greatly if more open source projects include fuzzing in their development process. To this end, we'd like to encourage more projects to participate and adopt the ideal integration guidelines that we've established.

Combined with fixing all the issues that are found, this is often a significant amount of work for developers who may be working on an open source project in their spare time. To support these projects, we are expanding our existing Patch Rewardsprogram to include rewards for the integration of fuzz targets into OSS-Fuzz.

To qualify for these rewards, a project needs to have a large user base and/or be critical to global IT infrastructure. Eligible projects will receive $1,000 for initial integration, and up to $20,000 for ideal integration (the final amount is at our discretion). You have the option of donating these rewards to charity instead, and Google will double the amount.

To qualify for the ideal integration reward, projects must show that:
  • Fuzz targets are checked into their upstream repository and integrated in the build system with sanitizer support (up to $5,000).
  • Fuzz targets are efficientand provide good code coverage (>80%) (up to $5,000).
  • Fuzz targets are part of the official upstream development and regression testing process, i.e. they are maintained, run against old known crashers and the periodically updated corpora(up to $5,000).
  • The last $5,000 is a "l33t" bonus that we may reward at our discretion for projects that we feel have gone the extra mile or done something really awesome.
We've already started to contact the first round of projects that are eligible for the initial reward. If you are the maintainer or point of contact for one of these projects, you may also reach out to us in order to apply for our ideal integration rewards.

The future

We'd like to thank the existing contributors who integrated their projects and fixed countless bugs. We hope to see more projects integrated into OSS-Fuzz, and greater adoption of fuzzing as standard practice when developing software.

Where do our flaky tests come from?

author: Jeff Listfield

When tests fail on code that was previously tested, this is a strong signal that something is newly wrong with the code. Before, the tests passed and the code was correct; now the tests fail and the code is not working right. The goal of a good test suite is to make this signal as clear and directed as possible.

Flaky (nondeterministic) tests, however, are different. Flaky tests are tests that exhibit both a passing and a failing result with the same code. Given this, a test failure may or may not mean that there's a new problem. And trying to recreate the failure, by rerunning the test with the same version of code, may or may not result in a passing test. We start viewing these tests as unreliable and eventually they lose their value. If the root cause is nondeterminism in the production code, ignoring the test means ignoring a production bug.
Flaky Tests at Google

Google has around 4.2 million tests that run on our continuous integration system. Of these, around 63 thousand have a flaky run over the course of a week. While this represents less than 2% of our tests, it still causes significant drag on our engineers.
If we want to fix our flaky tests (and avoid writing new ones) we need to understand them. At Google, we collect lots of data on our tests: execution times, test types, run flags, and consumed resources. I've studied how some of this data correlates with flaky tests and believe this research can lead us to better, more stable testing practices. Overwhelmingly, the larger the test (as measured by binary size, RAM use, or number of libraries built), the more likely it is to be flaky. The rest of this post will discuss some of my findings.
For a previous discussion of our flaky tests, see John Micco's postfrom May 2016.
Test size - Large tests are more likely to be flaky

We categorize our tests into three general sizes: small, medium and large. Every test has a size, but the choice of label is subjective. The engineer chooses the size when they initially write the test, and the size is not always updated as the test changes. For some tests it doesn't reflect the nature of the test anymore. Nonetheless, it has some predictive value. Over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky [1]. There's a clear increase in flakiness from small to medium and from medium to large. But this still leaves open a lot of questions. There's only so much we can learn looking at three sizes.
The larger the test, the more likely it will be flaky

There are some objective measures of size we collect: test binary size and RAM used when running the test [2]. For these two metrics, I grouped tests into equal-sized buckets [3] and calculated the percentage of tests in each bucket that were flaky. The numbers below are the r2 values of the linear best fit [4].

Correlation between metric and likelihood of test being flaky
Metric r2
Binary size 0.82
RAM used 0.76


The tests that I'm looking at are (for the most part) hermetic tests that provide a pass/fail signal. Binary size and RAM use correlated quite well when looking across our tests and there's not much difference between them. So it's not just that large tests are likely to be flaky, it's that the larger the tests get, the more likely they are to be flaky.

I have charted the full set of tests below for those two metrics. Flakiness increases with increases in binary size [5], but we also see increasing linear fit residuals [6] at larger sizes.


The RAM use chart below has a clearer progression and only starts showing large residuals between the first and second vertical lines.



While the bucket sizes are constant, the number of tests in each bucket is different. The points on the right with larger residuals include much fewer tests than those on the left. If I take the smallest 96% of our tests (which ends just past the first vertical line) and then shrink the bucket size, I get a much stronger correlation (r2 is 0.94). It perhaps indicates that RAM and binary size are much better predictors than the overall charts show.



Certain tools correlate with a higher rate of flaky tests
Some tools get blamed for being the cause of flaky tests. For example, WebDriver tests (whether written in Java, Python, or JavaScript) have a reputation for being flaky [7]. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool.

Flakiness of tests using some of our common testing tools
Category % of tests that are flaky % of all flaky tests
All tests
1.65%
100%
Java WebDriver
10.45%
20.3%
Python WebDriver
18.72%
4.0%
An internal integration tool
14.94%
10.6%
Android emulator
25.46%
11.9%


All of these tools have higher than average flakiness. And given that 1 in 5 of our flaky tests are Java WebDriver tests, I can understand why people complain about them. But correlation is not causation, and given our results from the previous section, there might be something other than the tool causing the increased rate of flakiness.
Size is more predictive than tool

We can combine tool choice and test size to see which is more important. For each tool above, I isolated tests that use the tool and bucketed those based on memory usage (RAM) and binary size, similar to my previous approach. I calculated the line of best fit and how well it correlated with the data (r2). I then computed the predicted likelihood a test would be flaky at the smallest bucket [8] (which is already the 48th percentile of all our tests) as well as the 90th and 95th percentile of RAM used.
Predicted flaky likelihood by RAM and tool
Category r2 Smallest bucket
(48th percentile)
90th percentile 95th percentile
All tests 0.76 1.5% 5.3% 9.2%
Java WebDriver 0.70 2.6% 6.8% 11%
Python WebDriver 0.65 -2.0% 2.4% 6.8%
An internal integration tool 0.80 -1.9% 3.1% 8.1%
Android emulator 0.45 7.1% 12% 17%


This table shows the results of these calculations for RAM. The correlation is stronger for the tools other than Android emulator. If we ignore that tool, the difference in correlations between tools for similar RAM use are around 4-5%. The differences from the smallest test to the 95th percentile for the tests are 8-10%. This is one of the most useful outcomes from this research: tools have some impact, but RAM use accounts for larger deviations in flakiness.
Predicted flaky likelihood by binary sizeand tool

Category r2 Smallest bucket
(33rd percentile)
90th percentile 95th percentile
All tests 0.82 -4.4% 4.5% 9.0%
Java WebDriver 0.81 -0.7% 14% 21%
Python WebDriver 0.61 -0.9% 11% 17%
An internal integration tool 0.80 -1.8% 10% 17%
Android emulator 0.05 18% 23% 25%


There's virtually no correlation between binary size and flakiness for Android emulator tests. For the other tools, you see greater variation in predicted flakiness between the small tests and large tests compared to RAM; up to 12% points. But you also see wider differences from the smallest size to the largest; 22% at the max. This is similar to what we saw with RAM use and another of the most useful outcomes of this research: binary size accounts for larger deviations in flakiness than the tool you use.
Conclusions

Engineer-selected test size correlates with flakiness, but within Google there are not enough test size options to be particularly useful.
Objectively measured test binary size and RAM have strong correlations with whether a test is flaky. This is a continuous function rather than a step function. A step function would have sudden jumps and could indicate that we're transitioning from one type of test to another at those points (e.g. unit tests to system tests or system tests to integration tests).
Tests written with certain tools exhibit a higher rate of flakiness. But much of that can be explained by the generally larger size of these tests. The tool itself seems to contribute only a small amount to this difference.
We need to be more careful before we decide to write large tests. Think about what code you are testing and what a minimal test would look like. And we need to be careful as we write large tests. Without additional effort aimed at preventing flakiness, there's is a strong likelihood you will have flaky tests that require maintenance.
Footnotes
  1. A test was flaky if it had at least one flaky run during the week.
  2. I also considered number of libraries built to create the test. In a 1% sample of tests, binary size (0.39) and RAM use (0.34) had stronger correlations than number of libraries (0.27). I only studied binary size and RAM use moving forward.
  3. I aimed for around 100 buckets for each metric.
  4. r2 measures how closely the line of best fit matches the data. A value of 1 means the line matches the data exactly.
  5. There are two interesting areas where the points actually reverse their upward slope. The first starts about halfway to the first vertical line and lasts for a few data points and the second goes from right before the first vertical line to right after. The sample size is large enough here that it's unlikely to just be random noise. There are clumps of tests around these points that are more or less flaky than I'd expect only considering binary size. This is an opportunity for further study.
  6. Distance from the observed point and the line of best fit.
  7. Other web testing tools get blamed as well, but WebDriver is our most commonly used one.
  8. Some of the predicted flakiness percents for the smallest buckets end up being negative. While we can't have a negative percent of tests be flaky, it is a possible outcome using this type of prediction.

Code Health: Google’s Internal Code Quality Efforts

By Max Kanat-Alexander, Tech Lead for Code Health and Author of Code Simplicity

There are many aspects of good coding practices that don't fall under the normal areas of testing and tooling that most Engineering Productivity groups focus on in the software industry. For example, having readable and maintainable code is about more than just writing good tests or having the right tools—it's about having code that can be easily understood and modified in the first place. But how do you make sure that engineers follow these practices while still allowing them the independence that they need to make sound engineering decisions?

Many years ago, a group of Googlers came together to work on this problem, and they called themselves the "Code Health" group. Why "Code Health"? Well, many of the other terms used for this in the industry—engineering productivity, best practices, coding standards, code quality—have connotations that could lead somebody to think we were working on something other than what we wanted to focus on. What we cared about was the processes and practices of software engineering in full—any aspect of how software was written that could influence the readability, maintainability, stability, or simplicity of code. We liked the analogy of having "healthy" code as covering all of these areas.

This is a field that many authors, theorists, and conference speakers touch on, but not an area that usually has dedicated resources within engineering organizations. Instead, in most software companies, these efforts are pushed by a few dedicated engineers in their extra time or led by the senior tech leads. However, every software engineer is actually involved in code health in some way. After all, we all write software, and most of us care deeply about doing it the "right way." So why not start a group that helps engineers with that "right way" of doing things?

This isn't to say that we are prescriptive about engineering practices at Google. We still let engineers make the decisions that are most sensible for their projects. What the Code Health group does is work on efforts that universally improve the lives of engineers and their ability to write products with shorter iteration time, decreased development effort, greater stability, and improved performance. Everybody appreciates their code getting easier to understand, their libraries getting simpler, etc. because we all know those things let us move faster and make better products.

But how do we accomplish all of this? Well, at Google, Code Health efforts come in many forms.

There is a Google-wide Code Health Group composed of 20%contributors who work to make engineering at Google better for everyone. The members of this group maintain internal documents on best practices and act as a sounding board for teams and individuals who wonder how best to improve practices in their area. Once in a while, for critical projects, members of the group get directly involved in refactoring code, improving libraries, or making changes to tools that promote code health.

For example, this central group maintains Google's code review guidelines, writes internal publications about best practices, organizes tech talks on productivity improvements, and generally fosters a culture of great software engineering at Google.

Some of the senior members of the Code Health group also advise engineering executives and internal leadership groups on how to improve engineering practices in their areas. It's not always clear how to implement effective code health practices in an area—some people have more experience than others making this happen broadly in teams, and so we offer our consulting and experience to help make simple code and great developer experiences a reality.

In addition to the central group, many products and teams at Google have their own Code Health group. These groups tend to work more closely on actual coding projects, such as addressing technical debt through refactoring, making tools that detect and prevent bad coding practices, creating automated code formatters, or making systems for automatically deleting unused code. Usually these groups coordinate and meet with the central Code Health group to make sure that we aren't duplicating efforts across the company and so that great new tools and systems can be shared with the rest of Google.

Throughout the years, Google's Code Health teams have had a major impact on the ability of engineers to develop great products quickly at Google. But code complexity isn't an issue that only affects Google—it affects everybody who writes software, from one person writing software on their own time to the largest engineering teams in the world. So in order to help out everybody, we're planning to release articles in the coming weeks and months that detail specific practices that we encourage internally—practices that can be applied everywhere to help your company, your codebase, your team, and you. Stay tuned here on the Google Testing Blog for more Code Health articles coming soon!

Discomfort as a Tool for Change

by Dave Gladfelter (SETI, Google Drive)

Introduction

The SETI (Software Engineer, Tools and Infrastructure) role at Google is a strange one in that there's no obvious reason why it should exist. The SWEs (Software Engineers) on a project understand its problems best, and understanding a problem is most of the way to fixing it. How can SETIs bring unique value to a project when SWEs have more on-the-ground experience with their impediments?

The answer is scope. A SWE is rewarded for being an expert in their particular area and domain and is highly motivated to make optimizations to their carved-out space. SETIs (and Test Engineers and EngProdin general) identify and solve product-wide problems.

Product-wide problems frequently arise because local optimizations don't necessarily add up to product-wide optimizations. The reason may be the limits of attention, blind spots, or mis-aligned incentives, but a group of SWEs each optimizing for their own sub-projects will not achieve product-wide maxima.

Often SETIs and Test Engineers (TEs) know what behavior they'd like to see, such as more integration tests. We may even have management's ear and convince them to mandate such tests. However, in the absence of incentives, it's unlikely that the decisions SWEs make in response to such mandates will add up to the behavior we desire. Mandates around methods/practices are often ineffective. For example, a mandate of documentation for each public method on an interface often results in "method foo does foo."

The best way to create product-wide efficiencies is to change the way the team or process works in ways that will (initially) be uncomfortable for the engineering team, but that pays dividends that can't be achieved any other way. SETIs and TEs must work to identify the blind spots and negative interactions between engineering teams and change the environment in ways that align engineering teams' incentives. When properly incentivized, SWEs will make optimal decisions enhanced by product-wide vision rather than micro-management.

Common Product-Wide Problems

Hard-to-use APIs

One common example of local optimizations resulting in cross-team de-optimization is documentation and ease-of-use of internal APIs. The team that implements an internal API is not rewarded for making it easy to use except in the most oblique ways. Clients are compelled to use the internal APIs provided to them, so the API owner has a monopoly and will set the price of using it at "you must read all the code and debug it yourself" in the absence of incentives or (rare) heroes.

Big, slow releases

Another example is large and slow releases. Without EngProd help or external pressure, teams will gravitate to the slowest, biggest release possible.

This makes sense from the position of any individual SWE: releases are painful, you have to ensure that there are no UI and API regressions, watch traffic and error rates for some time, and re-learn and use tools and processes that are complex and specific to releases.

Multiple teams will naturally gravitate to having one big release so that all of these costs can be bundled into one operation for "efficiency." The result is that engineers don't get feedback on features for weeks and versioning of APIs and data stores is ignored (since all the parts of the system are bundled together into one big release). This greatly slows down developer and feature velocity and greatly increases risks of cascading failures when the release fails.

How EngProd fixes product-wide problems

SETIs can nibble around the edges of these kinds of problems by writing tools and automation. TEs can create easy-to-use test environments that facilitate isolating and debugging faults in integration and ambiguities in APIs. We can use fancy technologies to sample live traffic and ensure that new versions of systems behave the same as previous versions. We can review design docs to ensure that they have an appropriate test plan. Often these actions do have real value. However, these are not the best way to align incentives to create a product-wide solution. Facilitating engineering teams' fruitful collaboration (and dis-incentivizing negative interactions) gives EngProd a multiplier that is hard to achieve with only tooling and automation.

Heroes are few and far between so we must turn to incentives, which is where discomfort comes in. Continuity is comfortable and change is painful. EngProd looks at how to change the problem so that teams are incentivized to work together fruitfully and disincentivized (discomforted) to pursue local optimizations exclusively.

So how does EngProd align incentives? Certainly there is a place for optimizing for optimal behaviors, such as easy-to-use integration environments. However, incentivizing optimal behaviors via negative feedback should not be overlooked. Each problem is different, so let's look at how to address the two examples above:

Incentivizing easy-to-use APIs

Engineers will make the things they're incentivized to make. For APIs, make teams incentivized to provide integration help in the form of fakes. EngProd works with team leads to ensure there are explicit objectives to provide Fakes for their APIs as part of the rollout.

Fakesare as-simple-as-possible implementations of a service that still can be used to do pre-submit testing of client interactions with the system. They don't replace integration tests, but they reduce the likelihood of finding errors in subsequent integration test runs by an order of magnitude.
Furthermore, have some subset of the same client-owned and server-owned tests run against the fakes (for quick presubmit testing) as well as the real implementation (for continuous integration testing) and work with management to make it the responsibility of the Fake owner to debug any discrepancies for either the client- or the server-owned tests.

This reverses the pain! API owners, who are in a position to make APIs better, are now the ones experiencing negative incentives when APIs are not easy to use. Previously, when clients felt the pain, they had no recourse other than to file easily-ignored bugs ("Closed: working as intended") or contribute changes to the API owners' codebase, hurting their own performance with distractions.

This will incentivize API owners to design APIs to be as simple as possible with as few side-effects as possible, and to provide high-quality fakes that make it easy for clients to integrate with the API. Some teams will certainly not like this change at first, but I have seen API teams come to the realization that this is the best choice for the larger effort and implement these practices despite their cost to the team in the short run.

Helping management set engineering team objectives may not seem like a typical SETI responsibility, but although management is responsible for setting performance incentives and objectives, they are not well-positioned to understand how the low-level decisions of different teams create harmful interactions and lower cross-team performance, so they need SETI and TE guidance to create an environment that encourages optimal behaviors.

Fast, small releases

Being forced to release more frequently than is required by feature deployment requirements has many beneficial side-effects that make release velocity a goal unto itself. SETIs and TEs faced with big, slow releases work with management to mandate a move to a set of smaller, more frequent releases. As release velocity is ratcheted up, negative behaviours such as too much manual testing or too much internal coupling become more painful, and many optimal behaviors are incentivized.

Less coupling between systems

When software is released together, it is easy to treat the seams between different components as implementation details. Resulting systems becoming so intertwined (coupled) that responsibilities between them are completely and randomly mixed and their interactions are too complex for any one person to understand. When two components are released separately and at different times, different versions of them must be compatible with one another. Engineers who were previously complacent about this fragility will become fearful of failed releases due to implicit contract changes. They will change their behavior in beneficial ways such as defining the contract between components explicitly and creating regression testing for it. The result is a system composed of robust, self-contained, more easily understood components.

Better/More automated testing

Manual testing becomes more painful as release velocity is ramped up. This will incentivize automated regression, UI and performance tests. This makes the team more agile and able to catch defects sooner and more cheaply.

Faster feedback

When incremental feature changes can be released to dogfood or other beta channels more frequently, user interaction designers and product managers get much faster feedback about what paths lead to better user engagement and experience than in big, slow releases where an entire feature is deployed simultaneously. This results in a better product.

Conclusion

The SETIs and TEs optimize interactions between teams and create fixes for product-wide, cross-team problems in order to improve engineering productivity and velocity. There are many worthwhile projects that EngProd can do using broad knowledge of the system and expertise in refactoring, automation and testing, such as creating test fixtures that enable continuous integration testing or identifying and combining duplicative tests or tools.

That said, the biggest problem that EngProd is positioned to solve is to break the chain of local optimizations resulting in cross-team de-optimizations. To that end, discomfort is a tool that can incentivize engineers to find solutions that are optimal for the entire product. We should look for and advocate for these transformative changes.