Tag Archives: CRE

CRE life lessons: The practicalities of dark launching



In the first part of this series, we introduced you to the concept of dark launches. In a dark launch, you take a copy of your incoming traffic and send it to the new service, then throw away the result. Dark launches are useful when you want to launch a new version of an existing service, but don’t want nasty surprises when you turn it on.

This isn’t always straightforward as it sounds, however. In this blog post, we’ll look at some of the circumstances that can make things difficult for you, and teach you how to work around them.

Finding a traffic source

Do you actually have existing traffic for your service? If you’re launching a new web service which is not more-or-less-directly replacing an existing service, you may not.

As an example, say you’re an online catalog company that lets users browse items from your physical store’s inventory. The system is working well, but now you want to give users the ability to purchase one of those items. How would you do a dark launch of this feature? How can you approximate real usage when no user is even seeing the option to purchase an item?

One approach is to fire off a dark-launch query to your new component for every user query to the original component. In our example, we might send a background “purchase” request for an item whenever the user sends a “view” request for that item. Realistically, not every user who views an item will go on to purchase it, so we might randomize the dark launch by only sending a “purchase” request for one in every five views.

This will hopefully give you an approximation of live traffic in terms of volume and pattern. Note that this can’t be expected to be totally accurate when it comes to to live traffic when the service is launched. But, it’s better than nothing.

Dark launching mutating services

Generally, a read-only service is fairly easy to dark-launch. A service with queries that mutate backend storage is far less easy. There are still strong reasons for doing the dark launch in this situation, because it gives you some degree of testing that you can’t reasonably get elsewhere, but you’ll need to invest significant effort to get the most from dark-launching.

Unless you’re doing a storage migration, you’ll need to make significant effort/payoff tradeoffs doing dark launches for mutating queries. The easiest option is to disable the mutates for the dark-launch traffic, returning a dummy response after the mutate is prepared but before it’s sent. This is safe, but it does mean that you’re not getting a full measurement of the dark launched service — what if it has a bug that causes 10% of the mutate requests to be incorrectly specified?

Alternatively, you might choose to send the mutation to a temporary duplicate of your existing storage. This is much better for the fidelity of your test, but great care will be needed to avoid sending real users the response from your temporary duplicate. It would also be very unfortunate for everyone if, at the end of your dark launch, you end up making the new service live when it’s still sending mutations to the temporary duplicate storage.

Storage migration

If you’re doing a storage migration — moving an existing system’s stored data from one storage system to another (for instance, MySQL to MongoDB because you’ve decided that you don’t really need SQL after all) — you’ll find that dark launches will be crucial in this migration, but you’ll have to be particularly careful about how you handle mutation-inducing queries. Eventually you’ll need mutations to take effect in both your old and new storage systems, and then you’ll need to make the new storage system the canonical storage for all user queries.

A good principle is that, during this migration, you should always make sure that you can revert to the old storage system if something goes wrong with the new one. You should know which of your systems (old and new) is the master for a given set of queries, and hence holds the canonical state. The mastership generally needs to be easily mutable and able to revert responsibility to the original storage system without losing data.

The universal requirement for a storage migration is a detailed written plan reviewed by not just your system stakeholders but also by your technical experts from the involved systems. Inevitably, your plan will miss things and will have to adapt as you move through the migration. Moving between storage systems can be an awfully big adventure — expect us to address this in a future blog post.

Duplicate traffic costs

The great thing about a well-implemented dark launch is that it exercises the full service in processing a query, for both the original and new service. The problem this brings is that each query costs twice as much to process. That means you should do the following:


  • Make sure your backends are appropriately provisioned for 2x the current traffic. If you have quota in other teams’ backends, make sure it’s temporarily increased to cover the dark launch as well.
  • If you’re connection-sensitive, ensure that your frontends have sufficient slack to accommodate a 2x connection count.
  • You should already be monitoring latency from your existing frontends, but keep a close eye on this monitoring stat and consider tightening your existing alerting thresholds. As service latency increases, service memory likely also increases, so you’ll want to be alert for either of these stats breaching established limits.


In some cases, the service traffic is so large that a 100% dark launch is not practical. In these instances, we suggest that you determine the largest percentage launch that is practical and plan accordingly, aiming to get the most representative selection of traffic in the dark launch. Within Google, we tend to launch a new service to Googlers first before making the service public. However, experience has taught us that Googlers are often not representative of the rest of the world in how they use a service.

An important consideration if your service makes substantial use of caching is that a sub-50% dark launch is unlikely to see material benefits from caching and hence will probably significantly overstate estimated load at 100%.

You may also choose to test-load your new service at over 100% of current traffic by duplicating some traffic — say, firing off two queries to the new service for every original query. This is fine, but you should scale your quota increases accordingly. If your service is cache-sensitive, then this approach will probably not be useful as your cache hit rate will be artificially high.

Because of the load impact of duplicate traffic, you should carefully consider how to use load shedding in this experiment. In particular, all dark launch traffic should be marked “sheddable” and hence be the first requests to be dropped by your system when under load.

In any case, if your service on-call sees an unexpected increase in CPU/memory/latency, they should drop the dark launch to 0% and see if that helps.

Summary

If you’re thinking about a dark launch for a new service, consider writing a dark launch plan. In that plan, make sure you answer the following questions:


  • Do you have existing traffic which you can fork and send to your new service?
  • Where will you fork the traffic: the application frontend, or somewhere else?
  • Will you fire off the message to the new backend asynchronously, or will you wait for it and impose a timeout?
  • What will you do with requests that generate mutations?
  • How and where will you log the responses from the original and new services, and how will you compare them?
    • Are you logging the following things: response code, backend latency, and response message size?
    • Will you be diffing responses? Are there fields that cannot meaningfully be diffed which you should skip in your comparison?
  • Have you made sure that your backends can handle 2x the current peak traffic, and have you given them temporary quota for it?
    • If not, at what percentage traffic will you stop the dark launch?
  • How are you going to select traffic for participation in the dark launch percentage: randomly, or by hashing on a key such as user ID?
  • Which teams need to know that this dark launch is happening? Do they know how to escalate concerns?
  • What’s your rollback plan after you make your new service live?


It may be that you don’t have enough surprises or excitement in your life; in that case, you don’t need to worry about dark launches. But if you feel that your service gives you enough adrenaline rushes already, dark launching is a great technique to make service launches really, really boring.

CRE life lessons: What is a dark launch, and what does it do for me?



Say you’re about to launch a new service. You want to make sure it’s ready for the traffic you expect, but you also don’t want to impact real users with any hiccups along the way. How can you find your problem areas before your service goes live? Consider a dark launch.

A dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user. (Note: We’ve also seen dark launches referred to as “feature toggles,” but this doesn’t generally capture the “dark” or hidden traffic aspect of the launch.)

Dark launches allow you to do two things:

  1. Verify that your new service handles realistic user queries in the same way as the existing service, so you don’t introduce a regression.
  2. Measure how your service performs under realistic load.
Dark launches typically transition gradually from a small percentage of the original traffic to a full (100%) dark launch where all traffic is copied to the new backend, discovering and resolving correctness and scaling issues along the way. If you already have a source of traffic for your new site — for instance, when you’re migrating from an existing frontend to a new frontend — then you’re an excellent candidate for a dark launch.


Where to fork traffic: clients vs. servers

When considering a dark launch, one key question is where the traffic copying/forking should happen. Normally this is the application frontend, i.e. the first service, which (after load balancing) receives the HTTP request from your user and calculates the response. This is the ideal place to do the fork, since it has the lowest friction of change — specifically, in varying the percentage of external traffic sent to the new backend. Being able to quickly push a configuration change to your application frontend that drops the dark launch traffic fraction back down to 0% is an important — though not crucial — requirement of a dark launch process.

If you don’t want to alter the existing application frontend, you could replace it with a new proxy service which does the traffic forking to both your original and a new version of the application frontend and handles the response diffing. However, this increases the dark launch’s complexity, since you’ll have to juggle load balancing configurations to insert the proxy before the dark launch and remove it afterwards. Your proxy almost certainly needs to have its own monitoring and alerting — all your user traffic will be going through it, and it’s completely new code. What if it breaks?

One alternative is to send traffic at the client level to two different URLs, one for the original service, and the other for the new service. This may be the only practical solution if you’re dark launching an entirely new app frontend and it’s not practical to forward traffic from the existing app frontend — for instance, if you’re planning to move a website from being served by an open-source binary to your own custom application. However, this approach comes with its own set of challenges.



The main risk in client changes is the lack of control over the client’s behavior. If you need to turn down the traffic to the new application, then you’ll at least need to push a configuration update to every affected mobile application. Most mobile applications don’t have a built-in framework for dynamically propagating configuration changes, so in this case you’ll need to make a new release of your mobile app. It also potentially doubles the traffic from mobile apps, which may increase user data consumption.

Another client change risk is that the destination change gets noticed, especially for mobile apps whose teardowns are a regular source of external publicity. Response diffing and logging results is also substantially easier within an application frontend than within a client.


How to measure a dark launch

It’s little use running a dark launch if you’re not actually measuring its effect. Once you’ve got your traffic forked, how do you tell if your new service is actually working? How will you measure its performance under load?

The easiest way is to monitor the load on the new service as the fraction of dark launch traffic ramps up. In effect, it’s a very realistic load test, using live traffic rather than canned traffic. Once you’re at 100% dark launch and have run over a typical load cycle — generally, at least one day — you can be reasonably confident that your server won’t actually fall over when the launch goes live.

If you’re planning a publicity push for your service, you should try to maximize the additional load you put on your service and adjust your launch estimate based on a conservative multiplier. For example, say that you can generate 3 dark launch queries for every live user query without affecting end-user latency. That lets you test how your dark-launched service handles three times the peak traffic. Do note, however, that increasing traffic flow through the system by this amount carries operational risks. There is a danger that your “dark” launch suddenly generates a lot of “light” — specifically, a flickering yellow-orange light which comes from the fire currently burning down your service. If you’re not already talking to your SREs, you need to open a channel to them right now to tell them what you’re planning.

Different services have different peak times. A service that serves worldwide traffic and directly faces users will often peak Monday through Thursday in the morning in the US as this is where users normally dominate traffic. By contrast, a service like a photo upload receiver is likely to peak on weekends when users take more photos, and will get huge spikes on major holidays like New Years. Your dark launch should try to cover the heaviest live traffic that it’s reasonable to wait for.

We believe that you should always measure service load during a dark launch as it is very representative data for your service and requires near-zero effort to do.

Load is not the only thing you should be looking at, however, as the following measurements should also be considered.


Logging needs

The point where incoming requests are forked to the original and new backends — generally, the application front end — is typically also the point where the responses come back. This is, therefore, a great place to record the responses for later analysis. The new backend results aren’t being returned to the user, so they’re not normally visible directly in monitoring at the application frontend. Instead, the application will want to log these responses internally.

Typically the application will want to log response code (e.g. 20x/40x/50x), latency of the query to the backend, and perhaps the response size, too. It should log this information for both the old and new backends so that the analysis can be a proper comparison. For instance, if the old backend is returning a 40x response for a given request, the new backend should be expected to return the same response, and the logs should enable developers to make this comparison easily and spot discrepancies.

We also strongly recommend that responses from original and new services are logged and compared throughout dark launches. This tells you whether your new service is behaving as you expect with real traffic. If your logging volume is very high, and you choose to use sampling to reduce the impact on performance and cost, make sure that you account in some way for the undetected errors in your traffic that were not included in the logs sample.

Timeouts as a protection

It’s quite possible that the new backend is slower than the original — for some or all traffic. (It may also be quicker, of course, but that’s less interesting.) This slowness can be problematic if the application or client is waiting for both original and new backends to return a response before returning to the client.

The usual approaches are either to make the new backend call asynchronous, or to enforce an appropriately short timeout for the new backend call after which the request is dropped and a timeout logged. The asynchronous approach is preferred, since the latter can negatively impact average and percentile latency for live traffic.

You must set an appropriate timeout for calls to your new service, and you should also make those calls asynchronous from the main user path, as this minimizes the effect of the dark launch on live traffic.


Diffing: What’s changed, and does it matter?

Dark launches where the responses from the old and new services can be explicitly diff’ed produce the most confidence in a new service. This is often not possible with mutations, because you can’t sensibly apply the same mutation twice in parallel; it’s a recipe for conflicts and confusion.

Diffing is nearly the perfect way to ensure that your new backend is drop-in compatible with the original. At Google, it’s generally done at the level of protocol buffer fields. There may be fields where it’s acceptable to tolerate differences, e.g. ordering changes in lists. There’s a trade-off between the additional development work required for a precise meaningful comparison and the reduced launch risk this comparison brings. Alternatively, if you expect a small number of responses to differ, you might give your new service a “diff error budget” within which it must fit before being ready to launch for real.

You should explicitly diff original and new results, particularly those with complex contents, as this can give you confidence that the new service is a drop-in replacement for the old one. In the case of complex responses, we strongly recommend either setting a diff “error budget” (accept up to 1% of responses differing, for instance) or excluding low-information, hard-to-diff fields from comparison.

This is all well and good, but what’s the best way to do this diffing? While you can do the diffing inline in your service, export some stats, and log diffs, this isn't always the best option. It may be better to offload diffing and reporting out of the service that issues the dark launch requests.

Within Google, we have a number of diffing services. Some run batch comparisons, some process data in live streams, others provide a UI for viewing diffs in live traffic. For your own service, work out what you need from your diffing and implement something appropriate.

Going live

In theory, once you’ve dark-launched 100% of your traffic to the new service, making it go “live” is almost trivial. At the point where the traffic is forked to the original and new service, you’ll return the new service response instead of the original service response. If you have an enforced timeout on the new service, you’ll change that to be a timeout on the old service. Job done! Now you can disable monitoring of your original service, turn it off, reclaim its compute resources, and delete it from your source code repository. (A team meal celebrating the turn-down is optional, but strongly recommended.) Every service running in production is a tax on support and reliability, and reducing the service count by turning off a service is at least as important as adding a new service.

Unfortunately, life is seldom that simple. (As id Software’s John Cash once noted, “I want to move to ‘theory,’ everything works there.”) At the very least, you’ll need to keep your old service running and receiving traffic for several weeks in case you run across a bug in the new service. If things start to break in your new service, your reflexive action should be to make the original service the definitive request handler because you know it works. Then you can debug the problem with your new service under less time pressure.

The process of switching services may also be more complex than we’ve suggested above. In our next blog post, we’ll dig into some of the plumbing issues that increase the transition complexity and risk.

Summary

Hopefully you’ll agree that dark launching is a valuable tool to have when launching a new service on existing traffic, and that managing it doesn’t have to be hard. In the second part of this series, we’ll look at some of the cases that make dark launching a little more difficult to arrange, and teach you how to work around them.

Making the most of an SRE service takeover – CRE life lessons



In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

Onboarding preparation

If a service entrance review determines that the service is suitable for SRE support, developers and the SRE team move into the “onboarding” phase, where they prepare for SREs to support the service.

While developers address the action items, the SRE team starts to familiarize itself with the service, building up service knowledge and familiarity with the existing monitoring tools, alerts and crisis procedures. This can be accomplished through several methods:
  • Education: present the new service to the rest of the team through tech talks, discussion sessions and "wheel of misfortune" scenarios.
  • “Take the pager for a spin”: share pager alerts with the developers for a week, and assess each page on the axes of criticality (does this indicate a user-impacting problem with the service?) and actionability (is there a clear path for the on-call to to resolve the underlying issue?). This gives the SRE team a quantitative measure of how much operational load the service is likely to impose.
  • On-call shadow: page the primary on-call developer and SRE at the same time. At this stage, responsibility for dealing with emergencies rests on the developer, but the developer and the SRE collaborate on debugging and resolving production issues together.

Measuring success


Q: I’ve gone through a lot of effort to make my service ready to hand over to SRE. How can I tell whether it was a good expenditure of scarce engineering time?

If the developer and SRE teams have agreed to hand over a system, they should also agree on criteria (including a timeframe) to measure whether the handover was successful. Such criteria may include (with appropriate numbers):
  • Absolute decrease of paging/outages count
  • Decreasing paging/outages as a proportion of (increasing) service scale and complexity.
  • Reduced time/toil from the point of new code passing tests to being deployed globally, and a flat (or decreasing) rollback rate.
  • Increased utilization of reserved resources (CPU, memory, disk etc.)
Setting these criteria can then prepare the ground for future handover proposals; if the success criteria for a previous handover were not met, the teams should carefully reconsider how this will change the handover plans for a new service.

Taking over the pager


Once all the blocking action items have been resolved, it’s time for SREs to take over the service pager. This should be a "no drama" event, with few, well-documented service alerts, that can be easily resolved by following procedures in the service playbook.

In theory, the SRE team will have identified most of these issues in the entrance review phase, but realistically there any many issues that are only apparent with sustained exposure to a service.

In the medium term (one to two months), SREs should build a list of deficiencies or areas for optimization in the system with regard to monitoring, resource consumption etc. This hitlist should primarily aim to reduce SRE “toil” (manual, repetitive, tactical work that has no enduring value), and secondarily fix aspects of the system, e.g., resource consumption or cruft accumulation, which can impact system performance. Tertiary changes may include things like updating the documentation to facilitate onboarding new SREs for system support.

In the long term (three to six months), SREs should expect to meet most or all of the pre-established measurements for takeover success as described above.

Q: That’s great, so now my developers can turn off their pager?

Not so fast, my friend. Although the SRE team has learned a lot about the service in the preceding months, they're still not experts; there will inevitably be failure modes involving arcane service behavior where the SRE on-call will not know what has broken, or how to fix it. There's no substitute for having a developer available, and we normally require developers to keep their on-call rotation so that the SRE on-call can page them if needed. We expect this to be a low rate of pages.

The nuclear option — handing back the pager


Not all SRE takeovers go smoothly, and even if the SREs have taken over the pager for a service, it’s possible for reliability to regress or operational load to increase. This might be for good reasons such as a “success disaster”  a sustained and unexpected spike in usage  or for bad reasons such as poor QA testing.

An SRE team can only handle so many services, and if one service starts to consume a disproportionate amount of SRE time, it's at risk of crowding out other services. In this case, the SRE team should proactively tell the developer team that they have a problem, and should do so in a neutral way that’s data-heavy:

In the past month we’ve seen 100 pages/week for service S1, compared to a steady rate of 20-30 pages/week over the past few weeks. Even though S1 is within SLO, the pages are dominating our operational work and crowding out service improvement work. You need to do one of the following:
  1. bring S1’s paging rate down to the original rate by reducing S1’s rate of change
  2. de-tune S1’s alerts so that most of them no longer page
  3. tell us to drop SRE support for services S2, S3 so our overall paging rate remains steady
  4. tell us to drop SRE support for S1
This lets the developer team decide what’s most important to them, rather than the SRE team imposing a solution.

There are also times when developers and SREs agree that handing back the pager to developers is the right thing to do, even if the operational load is normal. For example, imagine SREs are supporting a service, and developers come up with a new, shiny, higher-performing version. Developers support the new version initially, while working out its kinks, and migrate more and more users to it. Eventually the new version is the most heavily used  this is when SREs should take on the pager for the new service and hand the old service’s pager back to developers. Developers can then finish user migrations and turn down the old service at their convenience.

Converging your SRE and dev teams


Onboarding a service is about more than transferring responsibility from developers to SREs  it also improves mutual understanding between the two teams. The dev team gets to know what the SRE team does, and why, who the individual SREs are, and perhaps how they got that way. Similarly the SRE team gains a better understanding of the development team’s work and concerns. This increase in empathy is a Good Thing in itself, but is also an opportunity to improve future applications.

Now, when a developer team designs a new application or service, they should take the opportunity to invite the SRE team to the discussion. SRE teams can easily spot reliability issues in the design, and advise developers on ways to make the service easier to operate, set up good monitoring and configure sensible rollout policies from the start.

Similarly, when the SREs do future planning or design new tooling, they should include developers in the discussions; developers can advise them on future launches and projects, and give feedback on making the tools easier to operate or a better fit for developers’ needs.

Imagine that there was a brick wall between the SRE and developer teams; our original plan for service takeover was to throw the service over the wall and hope. Over the course of these blog posts, we’ve shown you how to make a hole in the wall so there can be two-way communication as the service is passed through, then expand it into a doorway so that SREs can come into the developers’ backyard and vice versa. Eventually, developers and SREs should tear down the wall entirely, and replace it with a low hedge and ornamental garden arch. SREs and developers should be able to see what’s going on in each others’ yard, and wander over to the other side as needed.


Summary


When an SRE takes on pager responsibility for developer-supported service, don’t just throw it over the fence into their yard. Work with the SRE team to help them understand how the service works and how it breaks, and to find ways to make it more resilient and easier to support. Make sure that supporting your service is a good use of the SRE team’s time, making use of their particular skills. With a carefully-planned handover process, you can both be confident that the queries will flow and your pagers will be (mostly) silent.

How SREs find the landmines in a service – CRE life lessons



In Part 1 of this blog post we looked at why an SRE team would or wouldn’t choose to onboard a new application. In this installment, we assume that the service would benefit substantially from SRE support, and look at what needs to be done for SREs to onboard it with confidence.

Onboarding review


Q: We have a new application that would make sense for SRE to support. Do I just throw it over the wall and tell the SRE team “Here you are; you’re on call for this now, best of luck”?

That’s a great approach  if your goal is failure. At first, your developer team’s assessment of the application’s importance for their support  and whether it’s in a supportable state  is likely to be rather different from your SRE team’s assessment, and arbitrarily imposing support for a service onto an SRE team is unlikely to work. Think about it  you haven’t convinced them that the service is a good use of their time yet, and human nature is that people don’t enthusiastically embrace doing something that they don’t really believe in, so they're unlikely to be active participants in making the service materially more reliable.

At Google, we’ve found that to successfully onboard a service into SRE, the service owner and SRE team must agree to a process for the SRE team to understand and assess the service, and identify critical issues to be resolved upfront (Incidentally, we follow a similar process when deciding whether or not to onboard a Google Cloud customer’s application into our Customer Reliability Engineering program). We typically split this into two phases:

  • SRE entrance review: where an SRE team assess whether a developer-supported service should be onboarded by SRE, and what the onboarding preconditions should be.
  • SRE onboarding/takeover: where a dev and SRE team agree in principle that the SRE team should take on primary operational responsibility for a service, and start negotiating the exact conditions for takeover (how and when the SREs will onboard the service).

It’s important to remember the motivations of the various parties in this process:

  • Developers want someone else to pick up support for the service, and make it run as well as possible. They want users to feel that the service is working properly, otherwise they'll move to a service run by someone else.
  • The SRE team wants to be sure that they're not being “sold a pup” with a hard-to-support service, and have a vision for making the production service lower in toil and more robust.
  • Meanwhile the company management wants to reduce the number of embarrassing service outages, as long as it doesn’t cost them too much in engineer time.

The SRE entrance review

During an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production. The purpose of an SER is to:

  1. Assess how the service would benefit from SRE ownership
  2. Identify service design, implementation and operational deficiencies that could be a barrier to SRE takeover
  3. And if SRE ownership is determined to be beneficial, identify the bug fixes, process changes and necessary service behavior needed before onboarding the service

An SRE team typically designates a single person or a small subset of the team to familiarize themselves with the service, and evaluate it for fitness for takeover.

The SRE looks at the service as-is: its performance, monitoring, associated operational processes and recent outage history, and asks themselves: “If I were on-call for this service right now, what are the problems I’d want to fix?” They might be visible problems, such as too many pages happening per day, or potential problems such as a dependency on a single machine that will inevitably fail some day.

A critical part of any SRE analysis is the service’s Service Level Objectives (SLOs), and associated Service Level Indicators (SLIs). SREs assume that if a service is meeting its SLOs then paging alerts should be rare or non-existent; conversely, if the service is in danger of falling out of SLO then paging alerts are loud and actionable. If these expectations don’t match reality, the SRE team will focus on changing either the SLO definitions or the SLO measurements.

In the review phase, SREs aim to understand:

  • what the service does
  • day-to-day service operation (traffic variation, releases, experiment management, config pushes)
  • how the service tends to break and how this manifests in alerts
  • rough edges in monitoring and alerting
  • where the service configuration diverges from the SRE team’s practices
  • major operational risks for the service


The SRE team also considers:

  • whether the service follows SRE team best practices, and if not, how to retrofit it
  • how to integrate the service with the SRE team’s existing tools and processes
  • the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?


The SRE takeover


The SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed. Most will be assigned to the development team, but the SRE team may be better suited for others. In addition, not all issues are blockers to SRE takeover (there might be design or architectural changes that SREs recommend for service robustness that could take many months to implement).

There are four main axes of improvement for a service in an onboarding process: extant bugs, reliability, automation and monitoring/alerting. On each axis there will be issues which will have to be solved before takeover (“blockers”), and others which would be beneficial to solve but not critical.

Extant bugs
The primary source of issues blocking SRE takeover tends to be action items from the service’s previous postmortems. The SRE team expects to read recent postmortems and verify that a) the proposed actions to resolve the outage root causes are what they’d expect and b) those actions are actually complete. Further, the absence of recent postmortems is a red flag for many SRE teams.
Reliability
Some reliability-related change requests might not directly block SRE takeover, as many reliability improvements relate to design, significant code changes, a change in back-end integrations or migration off a deprecated infrastructure component, and are targeting the longer-term evolution of the system towards a desired reliability increase.

The reliability-related changes that block takeover would be those which mitigate or remove issues which are known to cause significant downtime, or mitigate risks which are expected to cause an outage in the future.

Automation
This is a key concern for SREs considering take over of a service: how much manual work needs to be done to “operate” the service on a week-to-week basis, including configuration pushes, binary releases and similar time-sinks.

In order to find out what would be most useful to automate, the best way is for the SRE to get practical experience of the developer’s world. This means that the SREs should shadow the developer team’s typical week and get a feel for what routine manual work is actually involved for their on-call.

If there’s excessive manual work involved in supporting a service, automation usually solves the problem.

Monitoring/alerting
The dominant concern with most services undergoing SRE takeover is the paging rate  how many times the service wakes up the on-call staff. At Google, we adhere to the ”Treynor Maximum” of an average of two incidents per 12 hour shift (for an on-call team as a whole). Thus, an SRE team looks at the average incident load of a new service over the past month or so to see how it fits with their current incident load.

Generally, excessive paging rates are the result of one of three things:

  1. Paging on something that’s not intrinsically important e.g., task restart or hitting 80% capacity of disk. Instead, downgrade the page to a bug (if it’s not urgent) or eliminate it entirely. Moving to symptom-based monitoring (“users are actually seeing problems”) can help improve this situation.
  2. Page storms where one small incident/outage generates many pages. Try to group related pages for an incident into a single outage, to get a clearer picture of the system’s outage metrics.
  3. A system that’s having too many genuine problems. In this case SRE takeover in the near future is unlikely, but SREs may be able to help diagnose and resolve the root causes of the problems.
SREs generally want to see several weeks of low paging levels before agreeing to take over a service.

More general ways to improve the service might include:

  • integrating the service with standard SRE tools and practices e.g., load shedding, release processes and configuration pushes
  • extending and improving playbook entries to rely less on the developer team’s tribal knowledge
  • aligning the service’s configurations with the SRE team’s common languages and infrastructure
Ultimately, an SRE entrance review should produce guidance that's useful to developers even if the SRE team declines to onboard the service. In that event, the guidance from the review should still help developers make their service easier to operate and more reliable.

Smoothing the path


SREs need to understand the developers’ service, but SREs and developers also need to understand each other. If the developer team has not worked with SREs before, it can be useful for SREs to give “lightning” talks to the developers on SRE topics such as monitoring, canarying, rollouts and data integrity. This gives the developers a better idea of why the SREs are asking particular questions and pushing particular concerns.

One of Google’s SREs found that it was useful to “pretend that I am a dev team novice, and have the developer take me through the codebase, explain the history, show me where the main() function is, and so on.”

Similarly, SREs should understand the developers’ point of view and experience. During the SER, at least one SRE should sit with the developers, attend their weekly meetings and stand-ups, informally shadow their on-call and help out with day-to-day work to get a “big picture” view of the service and how it runs. It also helps remove distance between the two teams. Our experience has been that this is so positive in improving the developer-SRE relationship that the practice tends to continue even after the SER has finished.

Last but not least, the SRE entrance review document should also state clearly whether the service merits SRE takeover, and if so, why (or why not).

At this point, the developer team and SRE team both understand what needs to be done to make a service suitable for SRE takeover, if it is indeed feasible at all. In Part 3 of this blog post, we’ll look at how to proceed with a service takeover, and so both teams can benefit from the process.

Know thy enemy: how to prioritize and communicate risks – CRE life lessons



Editor’s note: We’ve spent a lot of time in CRE Life Lessons talking about how to identify and mitigate risks in your system. In this post, we’re going to talk about how to effectively communicate and stack-rank those risks.

When a Google Cloud customer engages with Customer Reliability Engineering (CRE), one of the first things we do is an Application Reliability Review (ARR). First, we try to understand your application’s goals: what it provides to users and the associated service level objectives (SLOs) (or we help you create SLOs if you do not have any!). Second, we evaluate your application and operations to identify risks that threaten your ability to reach your SLOs. For each identified risk, we provide a recommendation on how to eliminate or mitigate it based on our experiences at Google.

The number of risks identified for each application varies greatly depending on the maturity of your application and team and target level for reliability or performance. But whether we identify five risks or 50, two fundamental facts remain true: Some risks are worse than others, and you have a finite amount of engineering time to address them. You need a process to communicate the relative importance of the risks and to provide guidance on which risks should be addressed first. This appears easy, but beware! The human brain is notoriously unreliable at comparing and evaluating risks.

This post explains how we developed a method for analyzing risks during an ARR, allowing us to present our customers with a clear, ranked list of recommendations, explain why one risk is ranked above another, and describe the impact a risk may have on the application’s SLO target. By the end of this post, you’ll understand how to apply this to your own application, even without going through a CRE engagement.

Take one: the risk matrix

Each risk has many properties that can be used to evaluate its relative importance. In discussions internally and with customers, two properties in particular stand out as most relevant:
  • The likelihood of the risk occurring in a given time period.
  • The impact that would be felt if the risk materializes.
We began by defining three levels for each property, which are represented in the following 3x3 table.

Example table with representative risks for each category: The row headers represent likelihood and column headers represent impact.

Catastrophic
Damaging
Minimal
Frequent
Overload results in slow or dropped requests during the peak hour each day.
The wrong server is turned off and requests are dropped.
Restarts for weekly upgrades drop in-progress requests (i.e., no lame ducking).
Common
A bad release takes the entire service down. Rollback is not tested.
Users report an outage before monitoring and alerting notifies the operator.
A daylight savings bug drops requests.
Rare
There is a physical failure in the hosting location that requires complete restoration from a backup or disaster recovery plan.
Overload results in a cascading failure. Manual intervention is required to halt or fix the issue.
A leap year bug causes all servers to restart and drop requests.
We tested this approach with a couple of customers by bucketing the risks we had identified into the table. This is not a novel approach. We very quickly realized that our terminology and format are the same as that used in a risk matrix, a commonly used management tool in the risk assessment field. This realization seemed to confirm that we were on the right track, and had created something that customers and their management could easily understand.

We were right: Our customers told us that the table of risks was a good overview and was easy to grasp. However, we struggled to explain the relative importance of entries in the list based on the cells in the table:
  • The distribution of risks across the cells was extremely uneven. Most risks ended up in the “common, damaging” cell, which doesn’t help to explain relative importance of the items within each cell.
  • Assigning a risk to a cell (and its subsequent position in the list of risks) is subjective and depends on the reliability target of the application. For example, the “frequent, catastrophic” example of dropping traffic for a few minutes during a release is catastrophic at four nines, but less so at two nines.
  • Ordering the cells into a ranking is not straightforward. Is it more important to handle a “rare, catastrophic” risk, or a “frequent, minimal” risk? The answer is not clear from the names or definitions of the categories alone. Further, the desired order can change from matrix to matrix depending on the number of items in each cell.

Risk expressed as expected losses

As we showed in the previous section, the traditional risk matrix does a poor job of explaining the relative importance of each risk. However, the risk assessment field offers another useful model: using impact and likelihood to calculate the expected loss from a risk. Expressed as a numeric quantity, this expected loss value is great way to explain the relative importance of our list of risks.

How do we convert qualitative concepts of impact and likelihood to quantified values that we can use to calculate expected loss? Consider our earlier posts on availability and SLOs, specifically, the concepts of Mean Time Between Failure (MTBF), Mean Time To Recover (MTTR), and error budget. The MTBF of a risk provides a measure of likelihood (i.e., how long it takes for the risk to cause a failure), the MTTR provides a measure of impact (i.e., how long we expect the failure to last before recovering), and the error budget is the expected number of downtime minutes per year that you're willing to allow (a.k.a. accepted loss).

Now with this system, when we work through an ARR and catalog risks, we use our experience and judgement to estimate each risk’s MTBF (counted in days) and the subsequent MTTR (counted in minutes out of SLO). Using these two values, we estimate the expected loss in minutes for each risk over a fixed period of time, and generate the desired ranking.

We found that calculating expected losses over a year is a useful timeframe for risk-ranking, and developed a three-colour traffic light system to provide high-level guidance and quick visual feedback on the magnitude of each risk vs. the error budget:
  • Red: This risk is unacceptable, as it falls above the acceptable error budget for a single risk (we typically use 25%), and therefore, can have a major impact on your reliability in a single event.
  • Amber: This risk should not be acceptable, as it’s a major consumer of your error budget and therefore, needs to be addressed. You may be able to accept some amber risks by addressing some less urgent (green) risks to buy back budget.
  • Green: This is an acceptable risk. It's not a major consumer of your error budget, and in aggregate, does not cause your application to exceed the error budget. You don't have to address green risks, but may wish to do so to give yourself more budget to cover unexpected risks, or to accept amber risks that are hard to mitigate or eliminate.
Based on the three-colour traffic light system, the following table demonstrates how we rank and colour the risks given a 3-nines availability target. The risks are a combination of those in the original matrix and some additional examples to help illustrate the amber category. You can refer to the spreadsheet linked at the end of this post to see the precise MTTR and MTBF numbers that underlie this table, along with additional examples of amber risks.
Risk
Bad minutes/year
Overload results in slow or dropped requests during the peak hour each day.
3559
A bad release takes the entire service down. Rollback is not tested.
507
Users report an outage before monitoring and alerting notifies the operator.
395
There is a physical failure in the hosting location that requires complete restoration from a backup or disaster recovery plan.
242
The wrong server is turned off and requests are dropped.
213
Overload results in a cascading failure. Manual intervention is required to halt or fix the issue.
150
Operator accidentally deletes database; restore from backup is required
129
Unnoticed growth in usage triggers overload; service collapses.
125
A configuration mishap reduces capacity; causing overload and dropped requests
122
A new release breaks a small set of requests; not detected for a day.
119
Operator is slow to debug and root cause bug due to noisy alerting
76
A daylight savings bug drops requests.
71
Restarts for weekly upgrades drop in-progress requests (i.e., no lame ducking).
52
A leap year bug causes all servers to restart and drop requests.
16

Other Considerations

The ranked list of risks is extremely useful for communicating the findings of an ARR and conveying the relative magnitude of the risks compared to each other. We recommend that you use the list only for this purpose. Do not prioritize your engineering work directly based on the list. Instead, use the expected loss values as inputs to your overall business planning process, taking into consideration remediation and opportunity costs to prioritize work.

Also, don’t be tricked into thinking that because you have concrete numbers for the expected loss, that they are precise! They’re only as good as the estimates derived from MTBF and MTTR values. In the best case, MTBF and MTTR are averages from observed data; more commonly, they will be estimates based purely on intuition and experience. To minimize introducing errors into the final ranking, we recommend estimating MTBF and MTTR values likely to be within an order of magnitude of correct, rather than use specific, potentially inaccurate values.

Somewhat in contrast to the advice just mentioned, we find it useful to introduce additional granularity into the calculation of MTBF and MTTR values, for more accurate estimates. First, we split MTTR into two components:
  • Mean Time To Detect (MTTD): The time between when the risk first manifests and when the issue is brought to the attention of someone (or something) capable of remediating it.
  • Mean Time To Repair (MTTR): Redefined to mean the time between when the issue is brought to the attention of someone capable of remediating it and when it is actually remediated.
This granularity is driven by the realization that, often, the time to notice an issue and the time to fix it differ significantly. It’s easier to assess and ensure estimates are consistent across risks with these figures separately specified.

Second, in addition to considering MTTD, we also factor in what proportion of the users are affected by a risk (e.g., in a sharded system, shards can fail at a given rate and incur downtime before a successful failover succeeds, but each failure only impacts a proportion of the users). Taking these two optimizations into account, our overall formula for calculating the expected annual loss from a risk is:

(MTTD + MTTR) * (365.25 / MTBF) * percent of affected users

To implement this method for your own application, here is a spreadsheet template that you can copy and populate with your own data: https://goo.gl/bnsPj7

Summary

When analyzing the reliability of an application, it is easy to generate a large list of potential risks that must be prioritized for remediation. We have demonstrated how the MTBF and MTTR values of each risk can be used to develop a prioritized list of risks based on the expected impact on the annual error budget.

We here in CRE have found this method to be extremely helpful. In addition, customers can use the expected loss figure as an input to more comprehensive risk assessments, or cost/benefit calculations of future engineering work. We hope you find it helpful too!

How release canaries can save your bacon – CRE life lessons



The first part of any reliable software release is being able to roll back if something goes wrong; we discussed how we do this at Google in last week’s post, Reliable releases and rollbacks. Once you have that under your belt, you’ll want to understand how to detect that things are starting to go wrong in the first place, with canarying.
Photo taken by David Carroll
The concept of canarying first emerged in 1913 when physiologist John Scott Haldane took the caged bird down into a coal mine, to detect for carbon monoxide. This fragile bird is more susceptible to the odorless gas than humans, and quickly falls off its perch in its presence  signaling to the miners that it’s time to get out!

In software, a canary process is usually the first instance that receives live production traffic about a new configuration update, either a binary or configuration rollout. The new release only goes to the canary at first. The fact that the canary handles real user traffic is key: if it breaks, real users get affected, so canarying should be the first step in your deployment process, as opposed to the last step in testing.

The first step in implementing canarying is a manual process where release engineers trigger the new binary release to the canary instance(s). They then monitor the canary for any signs of increased errors, latency and load. If everything looks good, they then trigger a release to the rest of the production instances.

We here on Google’s SRE teams have found over time that manual inspection of monitoring graphs isn’t sufficiently reliable to detect performance problems or rises in error rates of a new release. When most releases work well, the release engineer gets used to seeing no problems and so, when a low-level problem appears, tends to implicitly rationalize the monitoring anomalies as “noise.” We have several internal postmortems on bad releases whose root cause boils down to “the canary graph wasn’t wiggly enough to make the release engineer concerned.”

We've moved towards automated analysis, where our canary rollout service measures the canary tasks to detect elevated errors, latency and load automatically  and roll back automatically. (Of course, this only works if rollbacks are safe!)

Likewise, if you implement canaries as part of your releases, take care to make it easy to see problems with a release. Consider very carefully how you implement fault tolerance in your canary tasks; it’s fine for the canary to do the best it can with a query, but if it starts to see errors either internally or from its dependency services then it should “squawk loudly” by manifesting those problems in your monitoring. (There’s a good reason why the Welsh miners didn’t breed canaries to be resistant to toxic gases, or put little gas masks on them.)

Client canarying

If you’re doing releases of client software, you should have a mechanism for canarying new versions of the client, and you'll need to answer the following questions:
  1. How will you deploy the new version to only a small percentage of users?
  2. How will you detect if the new version is crash-looping, dropping traffic or showing users errors? (“What's the monitoring sound of no queries happening?”)
A solution for question 2 is for clients to identify themselves to your backend service ideally, by including information in each request about the client’s operating system and application version ID  and for the server to log this information. If you can make the clients identify themselves specifically as canaries, so much the better; this lets you export their stats to a different set of monitoring metrics. To detect that clients are failing to send queries, you'll generally need to know what the lowest plausible amount of incoming traffic is at any given time of the day or week, and trigger an alert if inbound traffic drops below that amount.

Typically, alerting rules for canaries for high-availability systems use a longer evaluation duration (how long you listen to the monitoring signals before deciding you have a problem) than for the main system because the much smaller traffic amount makes the standard signal much noisier; a relatively innocuous problem such as a few service instances being restarted can briefly push the canary error rate above the regular alarm threshold.

Your release should normally aim to cover a wide range of user types but a small fraction of active users. For Android clients, the Google Play Store allows you to deploy a new version of your application package file (APK) to an (essentially random) fraction of users; you can do this on a country-by-country basis. However, see the discussion on Android APK releases below for the limitations and risks in this approach.

Web clients

If your end users access your service via desktop or mobile web rather than an application, you tend to have better control of what’s being executed.

Regular web clients whose UI is managed by JavaScript are fairly easy to control in that you have the potential to deliver updated JavaScript resources to them every time a page loads. However, if you cache JavaScript and similar resources client-side  which is useful in reducing service load and user latency+bandwidth consumption  it’s hard to roll back a bad change. As we discussed in our last post, anything that gets in the way of easy and quick rollbacks is going to be a problem.

One solution is to version your JavaScript files (first release in a /v1/ directory, second in a /v2/ etc.). Then the rollout simply consists of changing the resource links in your root pages to reference the new (or old) versions.

Android APK releases

New versions of an Android app can be rolled out to a % of current users using staged rollouts in the Play Store. This lets you try out a new release of an app on a small subset of your current users; once you have confidence in that release, you can roll it out to more users, and so on.

The % release mechanism marks a percent of users that are eligible to pick up the new release. When their mobile device next checks into the Play Store for updates, it will see an available update for the app and start the update process.

There can be problems with this approach though:
  • You have no control over when eligible-for-update users will actually check in; normally it’ll be within 24 hours, assuming they have adequate connectivity, but this may not be true for users in countries where cellular and Wi-Fi data services are slow and expensive per-byte.
  • You have no control over whether users will accept your update on their mobile device, which can be a particular issue if the new release requires additional permissions.
Following the canarying process described above, you can determine whether your new client release has a problem once your active user base of the canary grows enough for the characteristics of the new traffic become clear: Is there a higher error rate? Is the latency rising? Has traffic to your server mysteriously increased sharply?

If you have a known bad release of your app at version v, the most expedient fix (given the inability to roll back) might be to build your version v-1 code branch into release v+1 and release that, stepping up quickly to 100%. That removes the time pressure to fix the problems detected in code.

Release percentage steps

When you perform a gradual release of a new binary or app, you need to decide in what percentage increments to release your application, and when to trigger the next step in a release. Consider:
  1. The first (canary) step should generate enough traffic for any problems to be clear in your monitoring or logging; normally somewhere between 1% and 10% depending on the size of your user base.
  2. Each step involves significant manual work and delays the overall release. If you step by 3% per day, it will take you a month to do a complete release.
  3. Going up by a single large increment (say, 10% to 100%) can reveal dramatic traffic problems that weren’t apparent at much smaller traffic levels: try not to increase your upgraded user base by more than 2x per step if this is a risk.
  4. If a new version is good, you generally want most of your users to pick it up quickly. If you're doing a rollback, you want to ramp up to 100% much faster than for a new release.
  5. Traffic patterns are often diurnal  typically, highest during the daytime  so you may need at least 24 hours to see the peak traffic load after a release.
  6. In the case of mobile apps, you'll also need to allow time for the users to pick up and start using the new release after they’ve been enabled for it.
If you're looking to roll out an Android app update to most of your users within a few days, you might choose to use a Play Store staged update starting with a 10% rollout that then increases to 50% and finally 100%. Plan for at least 24 hours between release stages and check your monitoring and logging before the next step. This way, a large fraction of your user base picks up the new release within 72 hours of the initial release, and it’s possible to detect most problems before they become too big to handle. For launches where you know there's a risk of significant traffic increase to a service, choose to use steps of 10%, 25%, 50% and 100%  or even more fine-grained increases.

For internal binary releases where you update your service instances directly, you might instead choose to use steps of 1%, 10% then 100%. The 1% release lets you see if there's any gross error in the new release, e.g., if 90% of responses are errors. The 10% release lets you pick up errors or latency increases that are one order of magnitude smaller, and detect any gross performance differences. The third step is normally a complete release. For performance-sensitive systems  generally, those operating at 75%+ of capacity  consider adding a 50% step to catch more subtle performance regressions. The higher the target reliability of a system, the longer you should let each step “bake” to detect problems.

If an ideal marketing launch sequence is 0-100 (everyone gets the new features at once), and the ideal reliability engineer launch sequence is 0-0 (no change means no problems), the “right” launch sequence for an app is inevitably a matter of negotiation. Hopefully the considerations described here give you a principled way to determine a mutually acceptable rollout. The graph below shows you how these various strategies might play out over an 8-day release window.

Summary

In short, we here at Google have developed a software release philosophy that works well for us, for a variety of scenarios:
  • “Rollback early, rollback often.” Try to move your service towards this philosophy, and you’ll reduce the Mean Time To Recover of your service.
  • “Canary your rollouts.” No matter how good your testing and QA, you'll find that your binary releases occasionally have problems with live traffic. An effective canarying strategy and good monitoring can reduce the Mean Time To Detect these problems, and dramatically reduce the number of affected users.
At the end of the day, though, perhaps the best kind of launch is one where the features launched can be enabled independent of the binary rollout. That’s a blog post for another day.

Reliable releases and rollbacks – CRE life lessons



Editor’s note: One of the most common causes of service outages is releasing a new version of the service binaries; no matter how good your testing and QA might be, some bugs only surface when the affected code is running in production. Over the years, Google Site Reliability Engineering has seen many outages caused by releases, and now assumes that every new release may contain one or more bugs.

As software engineers, we all like to add new features to our services; but every release comes with the risk of something breaking. Even assuming that we are appropriately diligent in adding unit and functional tests to cover our changes, and undertaking load testing to determine if there are any material effects on system performance, live traffic has a way of surprising us. These are rarely pleasant surprises.

The release of a new binary is a common source of outages. From the point of view of the engineers responsible for the system’s reliability, that translates to three basic tasks:
  1. Detecting when a new release is actually broken;
  2. Moving users safely from a bad release to a “hopefully” fixed release; and
  3. Preventing too many clients from suffering through a bad release in the first place (“canarying”).
For the purpose of this analysis, we’ll assume that you are running many instances of your service on machines or VMs behind a load balancer such as nginx, and that upgrading your service to use a new binary will involve stopping and starting each service instance.

We’ll also assume that you monitor your system with something like Stackdriver, measuring internal traffic and error rates. If you don’t have this kind of monitoring in place, then it’s difficult to meaningfully discuss reliability; per the Hierarchy of Reliability described in the SRE Book, monitoring is the most fundamental requirement for a reliable system).

Detection

The best case for a bad release is that when a service instance is restarted with the bad release, a major fraction of improperly handled requests generate errors such as HTTP 502, or much higher response latencies than normal. In this case, your overall service error rate rises quickly as the rollout progresses through your service instances, and you realize that your release has a problem.

A more subtle case is when the new binary returns errors on a relatively small fraction of queries - say, a user setting change request, or only for users whose name contains an apostrophe for good or bad reasons. With this failure mode, the problem may only become manifest in your overall monitoring once the majority of your service instances are upgraded. For this reason, it can be useful to have error and latency summaries for your service instance broken down by binary release version.

Rollbacks

Before you plan to roll out a new binary or image to your service, you should ask yourself, “What will I do if I discover a catastrophic / debilitating / annoying bug in this release?” Not because it might happen, but because sooner or later it is going to happen and it is better to have a well-thought out plan in place instead of trying to make one up when your service is on fire.

The temptation for many bugs, particularly if they are not show-stoppers, is to build a quick patch and then “roll forward,” i.e., make a new release that consists of the original release plus the minimal code change necessary to fix the bug (a “cherry-pick” of the fix). We don’t generally recommend this though, especially if the bug in question is user-visible or causing significant problems internally (e.g., doubling the resource cost of queries).

What’s wrong with rolling forward? Put yourself in the shoes of the software developer: your manager is bouncing up and down next to your desk, blood pressure visibly climbing, demanding to know when your fix is going to be released because she has your company’s product director bending her ear about all the negative user feedback he’s getting. You’re coding the fix as fast as humanly possible, because for every minute it’s down another thousand users will see errors in the service. Under this kind of pressure, coding, testing or deployment mistakes are almost inevitable.

We have seen this at Google any number of times, where a hastily deployed roll-forward fix either fails to fix the original problem, or indeed makes things worse. Even if it fixes the problem it may then uncover other latent bugs in the system; you’re taking yourself further from a known-good state, into the wilds of a release that hasn’t been subject to the regular strenuous QA testing.

At Google, our philosophy is that “rollbacks are normal.” When an error is found or reasonably suspected in a new release, the releasing team rolls back first and investigates the problem second. A request for a rollback is not interpreted as an attack on the releasing team, or even the person who wrote the code containing the bug; rather, it is understood as The Right Thing To Do to make the system as reliable as possible for the user. No-one will ask “why did you roll back this change?” as long as the rollback changelist describes the problem that was seen.

Thus, for rollbacks to work, the implicit assumption is that they are:

  1. easy to perform; and
  2. trusted to be low-risk.

How do we make the latter true?

Testing rollbacks

If you haven’t rolled back in a few weeks, you should do a rollback “just because”; aim to find any traps with incompatible versions, broken automation/testing etc. If the rollback works, just roll forward again once you’ve checked out all your logs and monitoring. If it breaks, roll forward to remove the breakage and then focus all your efforts on diagnosing the cause of the rollback breakage. It is better by far to detect this when your new release is working well, rather than being forced off a release that is on fire and having to fight to get back to your known-good original release.

Incompatible changes

Inevitably, there are going to be times when a rollback is not straightforward. One example is when the new release requires a schema change to an in-app database (such as a new column). The danger is that you release the new binary, upgrade the database schema, and then find a problem with the binary that necessitates rollback. This leaves you with a binary that doesn’t expect the new schema, and hasn’t been tested with it.

The approach we recommend here is a feature-free release; starting from version v of your binary, build a new version v+1 which is identical to v except that it can safely handle the new database schema. The new features that make use of the new schema are in version v+2. Your rollout plan is now:
  1. Release binary v+1
  2. Upgrade database schema
  3. Release binary v+2
Now, if there are any problems with either of the new binaries then you can roll back to a previous version without having to also roll back the schema.

This is a special case of a more general problem. When you build the dependency graph of your service and identify all its direct dependencies, you need to plan for the situation where any one of your dependencies is suddenly rolled back by its owners. If your launch is waiting for a dependency service S to move from release r to r+1, you have to be sure that S is going to “stick” at r+1. One approach here is to make an ecosystem assumption that any service could be rolled back by one version, in which case your service would wait for S to reach version r+2 before your service moved to a version depending on a feature in r+1.

Summary

We’ve learned that there’s no good rollout unless you have a corresponding rollback ready to do, but how can we know when to rollback without having our entire service burned to the ground by a bad release?

In part 2 we’ll look at the strategy of “canarying” to detect real production problems without risking the bulk of your production traffic on a new release.

Reimagining support for Google Cloud Platform: new pricing model and partnerships



San Francisco  Cloud customers need a flexible and responsive relationship with their providers. We’ve taken a close look at how we offer support and today are announcing a new model, as well as partnerships with Pivotal and Rackspace, to deliver a closer and more effective way to engage with customers.

When you move to the cloud you’re choosing to bet your business on that platform. The way we see it, this relationship is more than just a typical ‘customer/vendor’ dynamic; it’s a partnership, especially when it comes to support. We started down this path last year when we launched Customer Reliability Engineering (CRE) based on the principles of Site Reliability Engineering (SRE), but we always knew that it was just the first step.

Engineering Support: new role-based model


Support is part of the overall product experience, and we think it should be tailored to your unique technical needs. We’re announcing the all new Engineering Support, a role-based subscription model that allows us to match engineer to engineer, so we can meet you where your business is, no matter what stage of development you’re in.

We know you don’t have just one project or team. We know you're constantly shifting engineers around as software development projects move from concept to development to production. With this in mind, we took a look at support and asked: Why should you be forced to pick just one support plan for your whole company? Why should you be locked into paying for a multi-year support contract? Why should you pay more for support as you spend more on the platform? If we’re doing our jobs, then you should spend less over time.

In rethinking our approach, and after listening to the feedback from customers, we focused on three principles:

  • Predictability - Flat fees per user per month; no variable percentage-of-platform-spend charges; You should know on day one what your costs will be on day 30.
  • Customizability - You should be able to configure your support entitlements to match the exact needs of your business.
  • Flexibility - You should be able to change the level of support from month-to-month as the unique needs of your business evolve.

With Engineering Support, we’ll offer three choices per support seat:

  • Development engineering support is ideal for developers or QA engineers that can manage with a response within four to eight business hours, priced at $100/user per month.
  • Production engineering support provides a one-hour response time for critical issues at $250/user per month.
  • On-call engineering support pages a Google engineer and delivers a 15-minute response time 24x7 for critical issues at $1,500/user per month.

With this new model, you pay for only the roles your team needs and can decide what time-frame of support responses best suits the lifecycle stages of your applications and who in your organization needs to interact with support.

The advantages of the Engineering Support model are:
  • You can mix and match your support levels and spend to the stages of development maturity for your projects. You can add, remove or change support levels monthly, from our Cloud Console. No more buying the highest tier for the whole company just because one project needs a 15-minute response time.
  • Prices are fixed so you know on the first day of the month what your support bill will be on the last day of the month. No more of the dreaded “success tax” where your support bill increases with cloud usage.
  • You can make adjustments month-to-month as your needs evolve, changing your support needs with shifts in your business.

Our goal is for Engineering Support to eventually replace the “precious metal” (e.g., silver, gold) tiers that link support costs directly to cloud usage. We're aiming to roll out Engineering Support with new customers this spring. We’ll also work with our existing customers to move them to the new model over the course of the year.

And, there's an option to not pay for support at all. As always, we support every customer for quota increase requests and billing questions at no cost. Further, forums, communities, issue lists and product documentation are accessible by all, 24x7.

Watch for more info soon at cloud.google.com/support.

Pivotal Cloud Foundry first CRE partner

We support our customers’ needs to use the mix of cloud services that is best for their business. And since we launched the CRE program, we’ve known that it’s important to have an ecosystem of technology partners to extend this mix. So we’re excited to announce that we’ve brought on Pivotal as our first CRE technology partner.

CRE technology partners will work hand-in-hand with Google to thoroughly review their solutions and implement changes to address identified risks to reliability. Additionally, we'll continuously work with our CRE partners to ensure that the partner solution’s configuration, architecture and conventions allow users to create highly reliable and available applications that are in line with CRE’s best practices.

We’re working closely with Pivotal to make sure that GCP customers who choose to use Pivotal Cloud Foundry can feel comfortable that they’re going to build and deploy highly reliable systems by default.

Rackspace Support for GCP

To ensure our customers are covered beyond our own support teams, we’re partnering with Rackspace to offer managed support for GCP. The Rackspace team is actively building their GCP practice, and we are providing them with the resources and tools they need to deliver the best experience to our joint customers.

Our goal is to make GCP the best choice for Rackspace customers looking to move to cloud by creating a high level of integration and experience. Rackspace already has Google Certified Professionals on staff, and expects to begin onboarding beta customers for GCP managed support in the coming months.

Visit us at Next

If you're at Google Cloud Next '17 this week, please stop by The Fifth Nine lounge to learn more about SRE and DevOps. We’re hosting some great sessions including SRE War Stories, where panelists will discuss memorable SRE events in Google history, talks by customers and partners, and a fireside chat with SRE creators Benjamin Treynor Sloss and Ben Lutch. See you there!

SLOs, SLIs, SLAs, oh my – CRE life lessons



Last week on CRE life lessons, we discussed how to come up with a precise numerical target for system availability. We term this target the Service Level Objective (SLO) of our system. Any discussion we have in future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.

We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load balancing between the two.

Why have an SLO at all?

Suppose that we decide that running our aforementioned Shakespeare service against a formally defined SLO is too rigid for our tastes; we decide to throw the SLO out of the window and make the service “as available as is reasonable.” This makes things easier, no? You simply don’t mind if the system goes down for an hour now and then. Indeed, perhaps downtime is normal during a new release and the attending stop-and-restart.

Unfortunately for you, customers don’t know that. All they see is that Shakespeare searches that were previously succeeding have suddenly started to return errors. They raise a high-priority ticket with support, who confirms that they see the error rate and escalate to you. Your on-call engineer investigates, confirms this is a known issue, and responds to the customer with “this happens now and again, you don’t have to escalate.” Without an SLO, your team has no principled way of saying what level of downtime is acceptable; there's no way to measure whether or not this a significant issue with the service. and you cannot terminate the escalation early with “Shakespeare search service is currently operating within SLO.” As our colleague Perry Lorier likes to say, “if you have no SLOs, toil is your job.”

The SLO you run at becomes the SLO everyone expects


A common pattern is to start your system off at a low SLO, because that’s easy to meet: you don’t want to run a 24/7 rotation, your initial customers are OK with a few hours of downtime, so you target at least 99% availability  1.68 hours downtime per week. But in fact, your system is fairly resilient and for six months operates at 99.99% availability  down for only a few minutes per month.

But then one week, something breaks in your system and it’s down for a few hours. All hell breaks loose. Customers page your on-call complaining that your system has been returning 500s for hours. These pages go unnoticed, because on-call leaves their pagers on their desks overnight, per your SLO which only specifies support during office hours.

The problem is, customers have become accustomed to your service being always available. They’ve started to build it into their business systems on the assumption that it’s always available. When it’s been continually available for six months, and then goes down for a few hours, something is clearly seriously wrong. Your excessive availability has become a problem because now it’s the expectation. Thus the expression, “An SLO is a target from above and below”  don’t make your system very reliable if you don’t intend and commit to it to being that reliable.

Within Google, we implement periodic downtime in some services to prevent a service from being overly available. In the SRE Book, our colleague Marc Alvidrez tells a story about our internal lock system  Chubby. Then, there’s the set of test front-end servers for internal services to use in testing, allowing those services to be accessible externally. These front-end servers are convenient but are explicitly not intended for use by real services; they have a one business day support SLA, and so can be down for 48 hours before the support team is even obligated to think about fixing them. Over time, experimental services that used those front-ends started to become critical; when we finally had a few hours of downtime on the front-ends, it caused widespread consternation.

Now we run a quarterly planned-downtime exercise with these front-ends. The front-end owners send out a warning, then block all services on the front-ends except for a small whitelist. They keep this up for several hours, or until a major problem with the blockage appears; the blockage can be quickly reversed in that case. At the end of the exercise the front-end owners receive a list of services that use the front-ends inappropriately, and work with the service owners to move them to somewhere more suitable. This downtime exercise keeps the front-end availability suitably low, and detects inappropriate dependencies in time to get them fixed.

Your SLA is not your SLO


At Google, we distinguish between a Service-Level Agreement (SLA) and a Service-Level Objective (SLO). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they'll push hard to keep it within SLA.

Because of this, and because the principle availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over 1 month with an internal availability SLO of 99.95%. Alternatively the SLA might only specify a subset of the metrics comprising the SLO.

For example, with our Shakespeare search service, we might decide to provide it as an API to paying customers in which a customer pays us $10K per month for the right to send up to one million searches per day. Now that money is involved, we need to specify in the contract how available they can expect the service to be, and what happens if we breach that agreement. We might say that we'll provide the service at a minimum of 99% availability, following the definition of successful queries given previously. If the service drops below 99% availability in a month, then we'll refund $2K; if it drops below 80% then, we'll refund $5K.

If you have an SLA that's different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You'll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries (we might not mind dropping queries from non-paying users if we have to start load shedding, but we really care about any query from the paying customer that we fail to handle properly). That’s another benefit of establishing an SLA  it’s an unambiguous way to prioritize traffic.

When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, suppose that you give each of three major customers (whose traffic dominates your service) a quota of one million queries per day. One of your customers releases a buggy version of their mobile client, and issues two million queries per day for two days before they revert the change. Over a 30-day period you’ve issued approximately 90 million good responses, and two million errors; that gives you a 97.8% success rate. You probably don’t want to give all your customers a refund as a result of this; two customers had all their queries succeed, and the customer for whom two million out of 32 million queries were rejected brought this upon themselves. So perhaps you should exclude all “out of quota” response codes from your SLA accounting.

On the other hand, suppose you accidentally push an empty quota specification file to your service before going home for the evening. All customers receive a default 1000 queries per day quota. Your three top customers get served constant “out of quota” errors for 12 hours until you notice the problem when you come into work in the morning, and revert the change. You’re now showing 1.5 million rejected queries out of 90 million for the month, a 98.3% success rate. This is all your fault: counting this as 100% success for 88.5M queries is missing the point and a moral failure for measuring the SLA.

Conclusion


SLIs, SLOs and SLAs aren’t just useful abstractions. Without them you cannot know if your system is reliable, available, or even useful. If they don’t tie explicitly back to your business objectives then you have no idea if the choices you make are helping or hurting your business. You also can’t make honest promises to your customers.

If you’re building a system from scratch, make sure that SLIs, SLOs and SLAs are part of your system requirements. If you already have a production system but don’t have them clearly defined then that’s your highest priority work.

To summarize:
  • If you want to have a reliable service, you must first define “reliability.” In most cases that actually translates to availability.
  • If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries; these will form the basis of your SLIs.
  • The more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with, and state that as your Service Level Objective (SLO).
  • Without an SLO, your team and your stakeholders cannot make principled judgements about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development).
  • If you’re charging your customers money you'll probably need an SLA, and it should be a little bit looser than your SLO.

As an SRE (or DevOps professional), it's your responsibility to understand how your systems serve the business in meeting those objectives, and, as much as possible, control for risks that threaten the high-level objective. Any measure of system availability that ignores business objectives is worse than worthless because it obfuscates the actual availability, leading to all sorts of dangerous scenarios, false senses of security and failure.

For those of you who wrote us thoughtful comments and questions from our last article, we hope this post has been helpful. Keep the feedback coming!

N. B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai, and other luminaries for three days of keynotes, code labs, certification programs, and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.

Available . . . or not? That is the question – CRE life lessons



In our last installment of the CRE life lessons series, we discussed how to survive a "success disaster" with load-shedding techniques. We got a lot of great feedback from that post, including several questions about how to tie measurements to business objectives. So, in this post, we decided to go back to first principles, and investigate what “success” means in the first place, and how to know if your system is “succeeding” at all.

A prerequisite to success is availability. A system that's unavailable cannot perform its function and will fail by default. But what is "availability"? We must define our terms:

Availability defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future. Sometimes availability is measured by using a count of requests rather than time directly. In either case, the structure of the formula is the same: successful units / total units. For example, you might measure uptime / (uptime + downtime), or successful requests / (successful requests + failed requests). Regardless of the particular unit used, the result is a percentage like 99.9% or 99.999%  sometimes referred to as “three nines” or “five nines.”

Achieving high availability is best approached by focusing on the unsuccessful component (e.g., downtime or failed requests). Taking a time-based availability metric as an example: given a fixed period of time (e.g., 30 days, 43200 minutes) and an availability target of 99.9% (three nines), simple arithmetic shows that the system must not be down for more than 43.2 minutes over the 30 days. This 43.2 minute figure provides a very concrete target to plan around, and is often referred to as the error budget. If you exceed 43.2 minutes of downtime over 30 days, you'll not meet your availability goal.

Two further concepts are often used to help understand and plan the error budget:

Mean Time Between Failures (MTBF): total uptime / # of failures. This is the average time between failures.

Mean Time to Repair (MTTR): total downtime / # of failures. This is the average time taken to recover from a failure.

These metrics can be computed historically (e.g., over the past 3 months, or year) and combined as (Total Period / MTBF) * MTTR to give an expected downtime value. Continuing with the above example, if the historical MTBF is calculated to be 10 days, and the historical MTTR is calculated to be 20 minutes, then you would expect to see 60 minutes of downtime ((30 days / 10 days) * 20 minutes)  clearly outside the 44-minute error budget for a three-nines availability target. To meet the target would require decreasing the MTBF (say to every 20 days) or decreasing the MTTR (say to 10 minutes), or a combination of both.

Keeping the concepts of error budget, MTBF and MTTR in mind when defining an availability target helps to provide justification for why the target is set where it is. Rather than simply describing the target as a fixed number of nines, it's possible to relate the numeric target to the user experience in terms of total allowable downtime, frequency and duration of failure.

Next, we'll look at how to ensure this focus on user experience is maintained when measuring availability.


Measuring availability


How do you know whether a system is available? Consider a fictitious "Shakespeare" service, which allows users to find mentions of a particular word or phrase in Shakespeare’s texts. This is a canonical example, used frequently within Google for training purposes, and mentioned throughout the SRE book.

Let's try working the scientific method to determine the availability of the hypothetical Shakespeare system.
  1. Question: how often is the system available?
  2. Observation: when you visit shakespeare.com, you normally get back the "200 OK" status code and an HTML blob. Very rarely, you see a 500 Internal Server error or a connection failure.
  3. Hypothesis: if "availability" is the percentage of requests per day that return 200 OK, the system will be 99.9% available.
  4. Measure: "tail" the response logs of the Shakespeare service’s web servers and dump them into a logs-processing system.
  5. Analyze: take a daily availability measurement as the percentage of 200 OK responses vs. the total number of requests.
  6. Interpret: After seven days, there’s a minimum of 99.7% availability on any given day.

Happily, you report these availability numbers to your boss (Dave), and go home. A job well done.

The next day Dave draws your attention to the support forum. Users are complaining that all their searches at shakespeare.com return no results. Dave asks why the availability dashboard shows 99.7% availability for the last day, when there clearly is a problem.

You check the logs and notice that the web server has received just 1000 requests in the last 24 hours, and they're all 200 OKs except for three 500s. Given that you expect at least 100 queries per second, that explains why users are complaining in the forums, although the dashboard looks fine.

You've made the classic mistake of basing your definition of availability on a measurement that does not match user-expectations or business objectives.


Redefining availability in terms of the user experience with black-box monitoring


After fixing the critical issue (a typo in a configuration file) that prevented the Shakespeare frontend service from reaching the backend, we take a step back to think about what it means for our system to be available.

If the "rate of 200 OK logs for shakespeare.com" is not an appropriate availability measurement, then how should we measure availability?

Dave wants to understand the availability as observed by users. When does the user feel that shakespeare.com is available? After some lively back-and-forth, we agree that the system is available when a user can visit shakespeare.com, enter a query and get a result for that query within five seconds, 100% of the time.

So you write a black-box "prober" (black-box, because it makes no assumptions about the implementation of the Shakespeare service, see the SRE Book, Chapter 6) to emulate a full range of clients devices (mobile, desktop). For each type of client, you visit shakespeare.com, enter the query "to be or not to be," and check that the result contains the expected link to Hamlet. You run the prober for a week, and finally recalculate the minimum daily availability measure: 80% of queries return Hamlet within five seconds, 18% of queries take longer, 1% timeout and another 1% return errors. A full 20% of queries fail our definition of availability!


Choosing an availability target according to business goals


After getting over his shock, Dave asks a simple question: “Why can't we have 100% returning within 5 seconds?”

You explain all the usual reasons why: power outages, fiber cuts, etc. After an hour or so, Dave is willing to admit that 100% query response in under five seconds is truly impossible.

Which leads, Dave to ask, “What availability can we have, then?”

You turn the question the question around on him: “What availability is required for us to meet our business goals?”

Dave's eyes light up. The business has set a revenue target of $25 million per year, and we make on average $0.01 per query result. At 100 queries per second * 3,1536,000 seconds per year * 80% success rate * $0.01 per query, we'll earn $25.23 million. In other words, even with a 20% failure rate, we'll still hit our revenue targets!

Still, a 20% failure rate is pretty ugly. Even if we think we'll meet our revenue targets, it's not a good user experience and we might have some attrition as a result. Should we fix it, and if so, what should our availability objective be?

Evaluating cost/benefit tradeoffs, opportunity costs


Suppose the rate of queries returning in greater than five seconds can be reduced to 0.5% if an engineer works on the problem for six months. How should we decide whether or not to do this?

We can start by estimating how much the 20% failure rate is going to cost us in missed revenue (accounting for users who give up on retrying) over the life of the product. We know roughly how much it will cost to fix the problem. Naively, we may decide that since the revenue lost due to the error rate exceeds the cost of fixing the issue, then we should fix it.

But this ignores a crucial factor… the opportunity cost of fixing the problem. What other things could an engineer have done with that time instead?

Hypothetically, there’s a new search algorithm that increases the relevance of Shakespeare search results, and putting it into production might drive a 20% increase in search traffic, even as availability remains constant. This increase in traffic could easily offset any lost revenue due to poor availability.

An oft-heard SRE saying is that you should “design a system to be as available as is required, but not much more.” At Google, when designing a system, we generally target a given availability figure (e.g., 99.9%), rather than particular MTBF or MTTR figures. Once we’ve achieved that availability metric, we optimize our operations for "fast fix," e.g., MTTR over MTBF, accepting that failure is inevitable, and that “spes consilium non est” (Hope is not a strategy). SREs are often able to mitigate the user visible impact of huge problems in minutes, allowing our engineering teams to achieve high development velocity, while simultaneously earning Google a reputation for great availability.

Ultimately, the tradeoff made between availability and development velocity belong to the business. Precisely defining the availability in product terms allows us to have a principled discussion and to make choices we can be proud of.

N.B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai and other luminaries for three days of keynotes, code labs, certification programs and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.