Tag Archives: CRE

An example escalation policy — CRE life lessons



In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making "revoking support" for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.

It's important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.

Escalation policy preamble


Before getting into the specifics of the escalation policy, it's important to consider the following broad points.

The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It's structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.

Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.

For the following four thresholds in the escalation policy, "bringing a service back into SLO" means:
  • finding the root cause and fixing the relevant class of issue, or
  • automating remediation such that ongoing manual intervention is no longer necessary, or 
  • simply waiting one week, if the class of issue is extremely unlikely to recur with frequency and severity sufficient to threaten the SLO in the future
In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it's unlikely to recur or to automate remediation.


Escalation policy thresholds


Threshold 1 -  wherein SRE are notified that an SLO is potentially impacted

SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.

Threshold 2 - wherein SRE escalates to the developers
  • If,
    • SRE have concluded they cannot bring the service into SLO without help, and
    • SRE and dev agree that the SLO represents desired user experience
  • Then,
    • SRE and dev on-calls prioritize fixing the root cause and update the bug daily
    • SRE escalates to dev leads for visibility and additional assistance if necessary
    • Alerting thresholds may be relaxed to avoid continually paging for the known issue, while continuing to provide protection against further regressions
  • When the service is brought back into SLO,
    • SRE will revert any alerting changes
    • SRE may create a postmortem
    • Or, if the SLO does not accurately represent desired user experience, the SRE, dev and product teams will agree to change or retire the SLO
Threshold 3 - wherein SRE pauses feature releases and focuses on reliability
  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 30-day error budget is exhausted
  • Then during the following week,
    • Only cherry-picked fixes for diagnosed root causes may be pushed to production
    • SRE may escalate to their leadership and dev management to request that members of the dev team prioritize finding and fixing the root cause over any non-emergency work
    • Daily updates may be made to an "escalations" mailing list (used to broadcast information about outages to a wide audience, including executive leadership).
  • When the service is brought back into SLO,
    • Normal binary releases resume
    • SRE creates a postmortem
    • Team members may re-prioritize normal project work

Threshold 4 - wherein SRE may escalate or revoke support

  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability
  • Then,
    • SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem
    • SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting

On escalation and incident response


SREs are first responders, and there's an expectation that they'll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.

Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don't need to ask the question over and over again, though that's a strong signal that further escalation is necessary. It's also OK to insist that developers take back the pager for the service until they're willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.

On blocking releases


Blocking releases is an appropriate course of action for three main reasons:
  1. Commonly, the largest source of burnt error budget at steady state is the release push. If you’ve already burned all your budget, not pushing new releases lowers the steady-state burn rate, bringing the service back into SLO more quickly
  2. It eliminates the risk of further unexpected SLO violations due to bugs in new code. This is also why any fixes for diagnosed root causes must be patched into the current release, rather than rolling forward to a new release
  3. While blocking releases is not intended as a punitive measure, it does directly impact release velocity, which the dev org cares about deeply. As such, tying SLO violations to reduced velocity aligns the incentives of both organizations. SRE wants the service to stay within SLO, the dev org wants to build new features quickly. This way, either both happen or neither do.
SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively "borrowing" future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.

SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there's less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.

Summary


We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we'll take these policy thresholds out for a spin with some hypothetical scenarios.

Consequences of SLO violations — CRE life lessons



Previous episodes of CRE life lessons have talked in detail about the importance of quantifying a service's availability and using SLOs to manage the competing priorities of features-focused development teams ("devs") versus a reliability-focused SRE team. Good SLOs can help reduce organizational friction and maintain development velocity without sacrificing reliability. But what should happen when SLOs are violated?

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy. Future posts will go over an example taken from an SRE team here at Google, and work through some scenarios that put that policy into action.

Features or reliability?


In the ideal world (assuming spherical SREs in a vacuum), an SLO represents the dividing line between two binary states: developing new features when there's error budget to spare, and improving service reliability when there isn't. Most real engineering organizations will instead vary their effort on a spectrum between these two extremes as business priorities dictate. Even when a service is operating well within its SLOs, choosing to do some proactive reliability work may reduce the risk of future outages, improve efficiency and provide cost savings; conversely it's rare to find an organization that completely drops all in-flight feature development as soon as an SLO is violated.

Describing key inflection points from that spectrum in a policy document is an important part of the relationship between an SRE team and the dev teams with whom they partner. This ensures that all parts of the organization have roughly the same understanding around what is expected of them when responding to (soon to be) violated SLOs, and – most importantly – that the consequences of not responding are clearly communicated to all parties. The exact choice of inflection points and consequences will be specific to the organization and its business priorities.

Inflection points


Having a strong culture of blameless postmortems and fixing root causes should eventually mean that most SLO violations are unique – informally, “we are in the business of novel outages.” It follows that the response to each violation will also be unique; making judgement calls around these is a part of an SREs job when responding to the violation. But a large variance in the range of possible responses results in inconsistency of outcomes, people trying to game the system and uncertainty for the engineering organization.

For the purposes of an escalation policy, we recommend that SLO violations be grouped into a few buckets of increasing severity based on the cumulative impact of the violation over time (i.e., how much error budget has been burned over what time horizon), with clearly defined boundaries for moving from one bucket to another. It's useful to have some business justification for why violations are grouped as they are, but this should be in an appendix to the main policy to keep the policy itself clear.

It's a good idea to tie at least some of the bucket boundaries to any SLO-based alerting you have. For example, you may choose to page SREs to investigate when 10% of the weekly error budget has been burned in the past hour; this is an example of an inflection point tied to a consequence. It forms the boundary between buckets we might informally title "not enough error budget burned to notify anyone immediately" and "someone needs to investigate this right now before the service is out of its long-term SLO." We'll examine more concrete examples in our next post, where we look at a policy from an SRE team within Google.

Consequences


The consequences of a violation are the meat of the policy. They describe actions that will be taken to bring the service back into SLO, whether this is by root causing and fixing the relevant class of issue, automating any stop-gap mitigation tasks or by reducing the near-term risk of further deterioration. Again, the choice of consequence for a given threshold is going to be specific to the organization defining the policy, but there are several broad areas into which these fall. This list is not exhaustive!

Notify someone of potential or actual SLO violation

The most common consequence of any potential or actual SLO violation is that your monitoring systems tells a human that they need to investigate and take remedial action. For a mature, SRE-supported service, this will normally be in the form of a page to the oncall when a large quantity of error budget has been burned over a short window, or a ticket when there’s an elevated burn rate over a longer time horizon. It's not a bad idea for that page to also create a ticket in which you can record debugging details, use as a centralized communication point and reference when escalating a serious violation.

The relevant dev team should also be notified. It's OK for this to be a manual process; the SRE team can add value by filtering and aggregating violations and providing meaningful context. But ideally a small group of senior people in the dev team should be made aware of actual violations in an automated fashion (e.g., by CCing them on any tickets), so that they're not surprised by escalations and can chime in if they have pertinent information.

Escalate the violation to the relevant dev team

The key difference between notification and escalation is the expectation of action on the part of the dev team. Many serious SLO violations require close cooperation between SREs and developers to find the root cause and prevent recurrence. Escalation is not an admission of defeat. SREs should escalate as soon as they're reasonably sure that input from the dev team will meaningfully reduce the time to resolution. The policy should set an upper bound on the length of time an SLO violation (or near miss) can persist without escalation.

Escalation does not signify the end of SRE’s involvement with an SLO violation. The policy should describe the responsibilities of each team and a lower bound on the amount of engineering time they should divert towards investigating the violation and fixing the root cause. It will probably be useful to describe multiple levels of escalation, up to and including getting executive-level support to commandeer the engineering time of the entire dev team until the service is reliable.

Mitigate risk of service changes causing further impact to SLOs

Since a service in violation of its SLO is by definition making users unhappy, day-to-day operations that may increase the rate at which error budget is burned should be slowed or stopped completely. Usually, this means restricting the rate of binary releases and experiments, or stopping them completely until the service is again within SLO. This is where the policy needs to ensure all parties (SRE, development, QA/testing, product and execs) are on the same page. For some engineering organizations, the idea that SLO violations will impact their development and release velocity may be difficult to accept. Reaching a documented agreement on how and when releases will be blocked – and what fraction of engineers will be dedicated to reliability work when this occurs – is a key goal.

Revoke support for the service

If a service is shown to be incapable of meeting its agreed-upon SLOs over an extended time period, and the dev team responsible for that service is unwilling to commit to engineering improvements to its reliability, then SRE teams at Google have the option of handing back the responsibility for running that service in production. This is unlikely to be the consequence of a single SLO violation, rather the combination of multiple serious outages over an extended period of time, where postmortem AIs have been assigned to the dev team but not prioritized or completed.

This has worked well at Google, because it changes the incentives behind any conversation around engineering for reliability. Any dev team that neglects the reliability of a service knows that they will bear the consequences of that neglect. By definition, revoking SRE support for a service is a last resort, but stating the conditions that must be met for it to happen makes it a matter of policy, not an idle threat. Why should SRE care about service reliability if the dev team doesn't?

Summary


Hopefully this post has helped you think about the trade-off between engineering for reliability and features, and how responding to SLO violations moves the needle towards reliability. In our next post, we'll present an escalation policy from one of Google's SRE teams, to show the choices they made to help the dev teams they partner with maintain a high development velocity.

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

Getting the most out of shared postmortems — CRE life lessons



In our previous post we discussed the benefits of sharing internal postmortems outside your company. You may adopt a one:many approach with an incident summary that tells all your customers what happened and how you'll prevent it from happening again. Or, if the incident impacted a major customer, you may share something close to your original postmortem with them.

In this post, we consider how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices. We also present a worked example of a shared postmortem based on the SRE Book postmortem template.


Postmortems should fix your customer too

How to get outages to benefit everyone

Even if the fault was 100% on you, the platform side, an external postmortem can still help customers improve their reliability. Now that we know what happens when a particular failure occurs, how can we generalize this to help the customer mitigate the impact, and reduce MTTD and MTTR for a similar incident in the future?

One of the best sources of data for any postmortem is your customers’ SLOs, with their ability to measure the impact of a platform outage. Our CRE team talks about SLOs quite a lot in the CRE Life Lessons series, and there’s a reason why: SLOs and error budgets inform more than just whether to release features in your software.

For customers with defined SLOs who suffered a significant error budget impact, we recommend conducting a postmortem review with them. The review is partly to ensure that the customer’s concerns were addressed, but also to identify “what went wrong,” “where we got lucky” and how to identify actions which would address these for the customer.

For example, the platform’s storage service suffered increased latency for a certain class of objects in a region. This is not the customer’s fault, but they may still be able to do something about it.

The internal postmortem might read something like:

What went well

  • The shared monitoring implemented with CustomerName showed a clear single-region latency hit which resulted in a quick escalation to storage oncall. 
What went wrong 

  • A new release of the storage frontend suffered from a performance regression for uncached reads that was not detected during testing or rollout. 
Where we got lucky 

  • Only reads of objects between 4KB and 32KB in size were materially affected. 
Action items

  • Add explicit read/write latency testing in testing for both cached and uncached objects in buckets of 1KB, 4KB, 32KB, … 
  • Have paging alerts for latency over SLO limits, aggregated by Cloud region, for both cached and uncached objects, in buckets of 1KB, 4KB, 32KB, ... 
When a customer writes their own postmortem about this incident, using the shared postmortem to understand better what broke in the platform and when, that postmortem might look like:

What went well

  • We had anticipated a generic single-region platform failure and had the capability to fail over out of an affected region. 
What went wrong 

  • Although the latency increase was detected quickly, we didn’t have accessible thru-stack monitoring that could show us that it was coming from platform storage-service rather than our own backends. 
  • Our decision to fail out of the affected region took nearly 30 minutes to complete because we had not practiced it for one year and our playbook instructions were out of date. 
Where we got lucky 

  • This happened during business hours so our development team was on hand to help diagnose the cause. 
Action items 

  • Add explicit dashboard monitoring for aggregate read and write latency to and from platform storage-service. 
  • Run periodic (at least once per quarter) test failovers out of a region to validate that the failover instructions still work and increase ops team confidence with the process. 

Prioritize and track your action items 


A postmortem isn’t complete until the root causes have been fixed 

Sharing the current status of your postmortem action items is tricky. It's unlikely that the customer will be using the same issue tracking system as you are, so neither side will have a “live” view of which action items from a postmortem have been resolved, and which are still open. Within Google we have automation which tracks this and “reminds” us of unclosed critical actions from postmortems, but customers can’t see those unless we surface them in the externally-visible part of our issue tracking system, which is not our normal practice.

Currently, we hold a monthly SLO review with each customer, where we list the major incidents and postmortem/incident report for each incident; we use that occasion to report on open critical bug statuses from previous months’ incidents, and check to see how the customer is doing on their actions.

Other benefits 

Opening up is an opportunity 

There are practical reliability benefits of sharing postmortems, but there are other benefits too. Customers who are evolving towards an SRE culture and adopting blameless postmortems can use the external postmortem as a model for their own internal write-ups. We’re the first to admit that it’s really hard to write your own first postmortem from scratch—having a collection of “known-good” postmortems as a reference can be very helpful.

At a higher level, shared postmortems give your customer a “glimpse behind the curtain.” When a customer moves from on-premises hardware to the cloud, it can be frightening; they're giving up a lot of control of and visibility into the platform on which their service runs. The cloud is expected to encapsulate the operational details of the services it offers, but unfortunately it can be guilty of hiding information that the customer really wants to see. A detailed external postmortem makes that information visible, giving the customer a timeline and deeper detail, which hopefully they can relate to.

Joint postmortems

If you want joint operations, you need joint postmortems 

The final step in the path to shared postmortems is creating a joint postmortem. Until this point, we’ve discussed how to externalize an existing document, where the action items, for example, are written by you and assigned to you. With some customers, however, it makes sense to do a joint postmortem where you both contribute to all sections of the document. It will not only reflect your thoughts from the event, but it will also capture the customer’s thoughts and reactions, too. It will even include action items that you assign to your customer, and vice-versa!

Of course, you can’t do joint postmortems with large numbers of your customers, but doing so with at least a few of them helps you (a) build shared SRE culture, and (b) keep the customer perspective in your debugging, design and planning work.

Joint postmortems are also one of the most effective tools you have to persuade your product teams to re-prioritize items on their roadmap, because they present a clear end-user story of how those items can prevent or mitigate future outages.


Summary 


Sharing your postmortems with your customers is not an easy thing to do; however, we have found that it helps:

  • Gain a better understanding of the impact and consequences of your outages
  • Increase the reliability of your customers’ service
  • Give customers confidence in continuing to run on your platform even after an outage.

To get you started, here's an example of an external postmortem for the aforementioned storage frontend outage, using the SRE Book postmortem template. (Note: Text relating to the customer (“JaneCorp”) is marked in purple for clarity.) We hope it sets you on the path to learning and growing from your outages. Happy shared postmortem writing!

Fearless shared postmortems — CRE life lessons



We here on Google’s Site Reliability Engineering (SRE) teams have found that writing a blameless postmortem — a recap and analysis of a service outage — makes systems more reliable, and helps service owners learn from the event.

Postmortems are easy to do within your company — but what about sharing them outside your organization? Why indeed, would you do this in the first place? It turns out that if you're a service or platform provider, sharing postmortems with your customers can be good for you and them too.

In this installment of CRE Life Lessons, we discuss the benefits and complications that external postmortems can bring, and some practical lessons about how to craft them.

Well-known external postmortems 

There is prior art, and you should read it. 

Over the years, we’ve had our share of outages and recently, we’ve been sharing more detail about them than we used to. For example, on April 11 2016 Google Compute Engine dropped inbound traffic, resulting in this public incident report.

Other companies are also publishing detailed postmortems about their own outages. Who can forget the time when:

We in Google SRE love reading these postmortems — and not because of schadenfreude. Indeed, many of us read them and think “there but for the grace of (Deity) go we” and wonder whether we would withstand a similar failure. Indeed, when you’re thinking this, it’s a good time to run a DiRT exercise.

For platform providers that offer a wide range of services to a wide range of users, fully public postmortems such as these make sense (even though they're a lot of work to prepare and open you up to criticism from competitors and press). But even if the impact of your outage isn’t as broad, if you are practising SRE, it can still make sense to share postmortems with customers that have been directly impacted. Caring about your customers’ reliability means sharing the details of your outages.

This is the position we take on the Google Cloud Platform (GCP) Customer Reliability Engineering. To help customers run reliably on GCP, we teach them how to engineer increased reliability for their service by implementing SRE best practices in our work together. We identify and quantify architectural and operational risks to each customer’s service, and work with them to mitigate those risks and drive to sustain system reliability at their SLO (Service Level Objectives) target.

Specifically, the CRE team works with each customer to help them meet the availability target expressed by their SLOs. For this, the principal steps are to:

  1. Define a comprehensive set of business-relevant SLOs
  2. Get the customer to measure compliance to those SLOs in their monitoring platform (how much of the service error budget has been consumed)
  3. Share that live SLO information with Google support and product SRE teams (which we term shared monitoring)
  4. Jointly monitor and react to SLO breaches with the customer (shared operational fate)
If you run a platform — or some approximation thereof — then you too should practice SRE with your customers to get that increased reliability, prevent your customers from tripping over your changes, and gain better insights into the impact and scope of your failures.

Then, when an incident occurs that causes the service to exceed its error budget — or consumes an unacceptably high proportion of the error budget — the service owner needs to determine:

  1. How much of the error budget did this consume in total?
  2. Why did the incident happen?
  3. What can / should be done to stop it from happening again? 
Answering Question #1 is easy, but the mechanism for evaluating Questions #2 and #3 is a postmortem. If the incident root cause was purely on the customer’s side, that’s easy — but what if the trigger was an event on your platform side? This is when you should consider an external postmortem.

Foundations of an external postmortem


Analyzing outages — and subsequently writing about them in a postmortem — benefits from having a two-way flow of monitoring data between the platform operator and the service owner, which provides an objective measure of the external impact of the incident: When did it start, how long did it last, how severe was it, and what was the total impact on the customer’s error budget? Here on the GCP CRE team, we have found this particularly useful, since it's hard to estimate the impact of problems in lower-level cloud services on end users. We may have observed a 1% error rate and increased latency internally, but was it noticeable externally after traveling through many layers of the stack?

Based on the monitoring data from the service owner and their own monitoring, the platform team can write their postmortem following the standard practices and our postmortem template. This results in an internally reviewed document that has the canonical view of the incident timeline, the scope and magnitude of impact, and a set of prioritized actions to reduce the probability of occurrence of the situation (increased Mean Time Between Failures), reduce the expected impact, improve detection (reduced Mean Time To Detect) and/or recover from the incident more quickly (reduced Mean Time To Recover).

With a shared postmortem, though, this is not the end: we want to expose some —though likely not all — of the postmortem information to the affected customer.

Selecting an audience for your external postmortem


If your customers have defined SLOs, they (and you) know how badly this affected them. Generally, the greater the error budget that has been consumed by the incident, the more interested they are in the details, and the more important it will be to share with them. They're also more likely to be able to give relevant feedback to the postmortem about the scope, timing and impact of the incident, which might not have been apparent immediately after the event.

If your customer’s SLOs weren’t violated but this problem still affected their customers, that’s an action item for the customer’s own postmortem: what changes need to be made to either the SLO or its measurements? For example, was the availability measurement further down in the stack compared to where the actual problem occurred?

If your customer doesn’t have SLOs that represent the end-user experience, it’s difficult to make an objective call about this. Unless there are obvious reasons why the incident disproportionately affected a particular customer, you should probably default to a more generic incident report.

Another factor you should consider is whether the customers with whom you want to share the information are under NDA; if not, this will inevitably severely limit what you're able to share.

If the outage has impacted most of your customers, then you should consider whether the externalized postmortem might be the basis for writing a public postmortem or incident report, like the examples we quoted above. Of course, these are more labor-intensive than external postmortems shared with select customers (i.e., editing the internal postmortem and obtaining internal approvals), but provide additional benefits.

The greatest gain from a fully public postmortem can be to restore trust from your user base. From the point of view of a single user of your platform, it’s easy to feel that their particular problems don’t matter to you. A public postmortem gives them visibility into what happened to their service, why, and how you're trying to prevent it from happening again. It’s also an opportunity for them to conduct their own mini-postmortem based on the information in the public post, asking themselves “If this happened again, how would I detect it and how could I mitigate the effects on my service?”

Deciding how much to share, and why?


Another question when writing external postmortems is how deep to get into the weeds of the outage. At one end of the spectrum you might share your entire internal postmortem with a minimum of redaction; at the other you might write a short incident summary. This is a tricky issue that we’ve debated internally.

The two factors we believe to be most important in determining whether to expose the full detail of a postmortem to a customer, rather than just a summary, are:

  1. How important are the details to understanding how to defend against a future re-occurrence of the event?
  2. How badly did the event damage their service, i.e., how much error budget did it consume? 
As an example, if the customer can see the detailed timeline of the event from the internal postmortem, they may be able to correlate it with signals from their own monitoring and reduce their time-to-detection for future events. Conversely, if the outage only consumed 8% of their 30-day error budget then all the customer wants to know is whether the event is likely to happen more often than once a month.

We have found that, with a combination of automation and practice, we can produce a shareable version of an internal postmortem with about 10% additional work, plus internal review. The downside is that you have to wait for the postmortem to be complete or nearly complete before you start. By contrast, you can write an incident report with a similar amount of effort as soon as the postmortem author is reasonably confident in the root cause.

What to say in a postmortem 


By the time the postmortem is published, the the incident has been resolved, and the customer really cares about three questions:

  1. Why did this happen? 
  2. Could it have been worse? 
  3. How can we make sure it won’t happen again?


“Why did this happen?” comes from the “Root causes and Trigger” and “What went wrong” sections of our postmortem template. “Could it have been worse?” comes from “Where we got lucky.”
These are two sections which you should do your best to retain as-is in an external postmortem, though you may need to do some rewording for clarity.

“How can we make sure it won’t happen again” will come from the Action items table of the postmortem.

What not to say

With that said, postmortems should never include these three things:

  1. Names of humans - Rather than “John Smith accidentally kicked over a server”, say “a network engineer accidentally kicked over a server,” Internally, we try to express the role of humans in terms of role rather than name. This helps us keep a blameless postmortem culture
  2. Names of internal systems - The names of your internal systems are not clarifying for your users and creates a burden on them to discover how these things fit together. For example, even though we’ve discussed Chubby externally, we still refer to it in postmortems we make external as “our globally distributed lock system.”
  3. Customer-specific information - The internal version of your postmortem will likely say things like “on XX:XX, Acme Corp filed a Support ticket alerting us to a problem.” It’s not your place to share this kind of detail externally as it may create an undue burden for the reporting company (in this case Acme Corp.). Rather, simply say “on XX:XX, a customer filed…”. If you’re going to reference more than one customer, then just label them Customer A, Customer B, etc..

Other things to watch out for

Another source of difficulty when rewriting a shared postmortem is the “Background” section that sets the scene for the incident. An internal postmortem assumes the reader has basic knowledge of the technical and operational background; this is unlikely to be true for your customer. We try to write the least detailed explanation that still allows the reader to understand why the incident happened; too much detail here is more likely to be off-putting than helpful.

Google SREs are fans of embedding monitoring graphs in postmortems; monitoring data is objective and doesn’t generally lie to you (although our colleague Sebastian Kirsch has some very useful guidance as to when this is not true). When you share a postmortem outside the company, however, be careful what information these graphs reveal about traffic levels and number of users of a service. Our rule of thumb is to leave the X axis (time) alone, but for the Y axis either remove the labels and quantities all together, or only show percentages. This is equally true for incorporating customer-generated data in an internal postmortem.

A side note on the role of luck


With apologies to Tina Turner, What’s luck got to do, got to do with it? What’s luck but a source of future failures?

As well as “What went well” and “What went badly” our internal postmortem template includes the section “Where we got lucky.” This is a useful place to tease out risks of future failures that were revealed by an incident. In many cases an incident had less impact than it might have, because of relatively random factors such as timing, presence of a particular person as the on-call, or co-incidence with another outage that resulted in more active scrutiny of the production systems than normal.

“Where we got lucky” is an opportunity to identify additional action items for the postmortem, e.g.,

  • “the right person was on-call” implies tribal knowledge that needs to be fed into a playbook and exercised in a DiRT test
  • “this other thing (e.g., a batch process or user action) wasn’t happening at the same time” implies that your system may not have sufficient surplus capacity to handle a peak load, and you should consider adding resources
  • “the incident happened during business hours” implies a need for automated alerting and 24-hour pager coverage by an on-call
  • “we were already watching monitoring” implies a need to tune alerting rules to pick up the leading edge of a similar incident if it isn’t being actively inspected.

Sometimes teams also add “Where we got unlucky,” when the incident impact was aggravated by a set of circumstances that are unlikely to re-occur. Some examples of unlucky behavior are:

  • an outage occurred on your busiest day of the year
  • you had a fix for the problem that hadn't been rolled out for other reasons
  • a weather event caused a power loss.


A major risk in having a “Where we got unlucky” category is that it’s used to label problems that aren’t actually due to blind misfortune. Consider this example from an internal postmortem:

Where we got unlucky

There were various production inconsistencies caused by past outages and experiments. These weren’t cleaned up properly and made it difficult to reason about the state of production.

This should instead be in “What went badly,” because there are clear action items that could remediate this for the future.

When you have these unlucky situations, you should always document them as part of "What went badly," while assessing the likelihood of them happening again and determining what actions you should take. You may choose not to mitigate every risk since you don’t have infinite engineering time, but you should always enumerate and quantify all the risks you can see so that “future you” can revisit your decision as circumstances change.

Summary


Hopefully we've provided a clear motivation for platform and service providers to share their internal postmortems outside the company, at some appropriate level of detail. In the next installment, we'll discuss how to get the greatest benefit out of these postmortems.

Building good SLOs – CRE life lessons



In a previous episode of CRE Life Lessons, we discussed how choosing good service level indicators (SLIs) and service level objectives (SLOs) is critical for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we’re going to get meta and go into more detail about some best practices we use at Google to formulate good SLOs for our SLIs.

SLO musings


SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Your business needs to be able to defend an endangered SLO by reducing the frequency of outages, or reducing the impact of outages when they occur. Some ways to do this might include: slowing down the rate at which you release new versions, or by implementing reliability improvements instead of features. All parts of your business need to acknowledge that these SLOs are valuable and should be defended through trade-offs.

Here are some important things to keep in mind when designing your SLOs:
  • An SLO can be a useful tool for resolving meaningful uncertainty about what a team should be doing. The objective is a line in the sand between "we definitely need to work on this issue" and "we might not need to work on this issue." Therefore, don’t pick SLO targets that are higher than what you actually need, even if you happen to be meeting them now, as that reduces your flexibility to change things in the future, including trade offs against reliability, like development velocity.
  • Group queries into SLOs by user experience, rather than by specific product elements or internal implementation details. For example, direct responses to user action should be grouped into a different SLO than background or ancillary responses (e.g., thumbnails). Similarly, “read” operations (e.g., view product) should be grouped into a different SLO than lower volume but more important “write” ones (e.g., check out). Each SLO will likely have different availability and latency targets.
  • Be explicit about the scope of your SLOs and what they cover (which queries, which data objects) and under what conditions they are offered. Be sure to consider questions like whether or not to count invalid user requests as errors, or happens when a single client spams you with lots of requests.
  • Finally, though somewhat in tension with the above, keep your SLOs simple and specific. It’s better not to cover non-critical operations with an SLO than to dilute what you really care about. Gain experience with a small set of SLOs; launch and iterate!

Example SLOs


Availability

Here we're trying to answer the question "Was the service available to our user?" Our approach is to count the failures and known missed requests, and report the measurement as a percentage. Record errors from the first point that is in your control (e.g., data from your Load Balancer, not from the browser’s HTTP requests). For requests between microservices, record data from the client side, not the server side.

That leaves us with an SLO of the form:

Availability: <service> will <respond successfully> for <a customer scope> for at least <percentage> of requests in the <SLO period>

For example . . .

Availability: Node.js will respond with a non-503 within 30 seconds for browser pageviews for at least 99.95% of requests in the month.

. . . and . . .

Availability: Node.js will respond with a non-503 within 60 seconds for mobile API calls for at least 99.9% of requests in the month.

For requests that took longer than 30 seconds (60 second for mobile), the service might as well have been down, so they count against our availability SLO.

Latency

Latency is a measure of how well a service performed for our users. We count the number of queries that are slower than a threshold, and report them as a percentage of total queries. The best measurements are done as close to the client as possible, so measure latency at the Load Balancer for incoming web requests, and from the client not the server for requests between microservices.

Latency: <service> will respond within <time limit> for at least <percentage> of requests in the <SLO period>.

For example . . .

Latency: Node.js will respond within 250ms for at least 50% of requests in the month, and within 3000ms for at least 99% of requests in the month.


Percentages are your friend . . .

Note that we expressed our latency SLI as a percentage: “percentage of requests with latency < 3000ms” with target of 99%, not “99th percentile latency in ms” with target “< 3000ms”. This keeps SLOs consistent and easy to understand, because they all have the same unit and the same range. Also, accurately computing percentiles across large data sets is hard, while counting the number of requests below a threshold is easy. You’ll likely want to monitor multiple thresholds (e.g., percentage of requests < 50ms, < 250ms, . . .), but having SLO targets of 99% for one threshold, and possibly 50% for another, is generally sufficient.

Avoid targeting average (mean) latency  it's almost never what you want. Averages can hide outliers, and sufficiently small values are indistinguishable from zero; users will not notice a difference between 50 ms and 250 ms for a full page response time, and thus they should be comparably good. There’s a big difference between an average of 250ms because all requests are taking 250ms, and an average of 250ms because 95% of requests are taking 1ms and 5% of requests are taking 5s.

. . . except 100%

A target of 100% is impossible over any meaningful length of time. It’s also likely not necessary. SREs use SLOs to embrace risk; the inverse of your SLO target is your error budget, and if your SLO target is 100% that means you have no error budget! In addition, SLOs are a tool for establishing team priorities dividing top-priority work from work that's prioritized on a case-by-case basis. SLOs tend to lose their credibility if every individual failure is treated as a top priority.

Regardless of the SLO target that you eventually choose, the discussion is likely to be very interesting; be sure to capture the rationale for your chosen target for posterity.

Reporting

Report on your SLOs quarterly, and use quarterly aggregates to guide policies, particularly pager thresholds. Using shorter periods tends to shift focus to smaller, day-to-day issues, and away from the larger, infrequent issues that are more damaging. Any live reporting should use the same sliding window as the quarterly report, to avoid confusion; the published quarterly report is merely a snapshot of the live report.

Example quarterly SLO summary

This is how you might present the historical performance of your service against SLO, e.g., for a semi-annual service report, where the SLO period is one quarter:

SLO
Target
Q2
Q3
Web Availability
99.95%
99.92%
99.96%
Mobile Availability
99.9%
99.91%
99.97%
Latency ≤ 250ms
50%
74%
70%
Latency ≤ 3000ms
99%
99.4%
98.9%

For SLO-dependent policies such as paging alerts or freezing of releases when you’ve spent the error budget, use a sliding window shorter than a quarter. For example, you might trigger a page if you spent ≥1% of the quarterly error budget over the last four hours, or you might freeze releases if you spent ≥ ⅓ of the quarterly budget in the last 30 days.

Breakdowns of SLI performance (by region, by zone, by customer, by specific RPC, etc.) are useful for debugging and possibly for alerting, but aren’t usually necessary in the SLO definition or quarterly summary.

Finally, be mindful about with whom you share your SLOs, especially early on. They can be a very useful tool for communicating expectations about your service, but the more broadly they are exposed the harder it is to change them.

Conclusion


SLOs are a deep topic, but we’re often asked about handy rules of thumb people can use to start reasoning about them. The SRE book has more on the topic, but if you start with these basic guidelines, you’ll be well on your way to avoiding the most common mistakes people make when starting with SLOs. Thanks for reading, we hope this post has been helpful. And as we say here at Google, may the queries flow, your SLOs be met and the pager stay silent!

CRE life lessons: The practicalities of dark launching



In the first part of this series, we introduced you to the concept of dark launches. In a dark launch, you take a copy of your incoming traffic and send it to the new service, then throw away the result. Dark launches are useful when you want to launch a new version of an existing service, but don’t want nasty surprises when you turn it on.

This isn’t always straightforward as it sounds, however. In this blog post, we’ll look at some of the circumstances that can make things difficult for you, and teach you how to work around them.

Finding a traffic source

Do you actually have existing traffic for your service? If you’re launching a new web service which is not more-or-less-directly replacing an existing service, you may not.

As an example, say you’re an online catalog company that lets users browse items from your physical store’s inventory. The system is working well, but now you want to give users the ability to purchase one of those items. How would you do a dark launch of this feature? How can you approximate real usage when no user is even seeing the option to purchase an item?

One approach is to fire off a dark-launch query to your new component for every user query to the original component. In our example, we might send a background “purchase” request for an item whenever the user sends a “view” request for that item. Realistically, not every user who views an item will go on to purchase it, so we might randomize the dark launch by only sending a “purchase” request for one in every five views.

This will hopefully give you an approximation of live traffic in terms of volume and pattern. Note that this can’t be expected to be totally accurate when it comes to to live traffic when the service is launched. But, it’s better than nothing.

Dark launching mutating services

Generally, a read-only service is fairly easy to dark-launch. A service with queries that mutate backend storage is far less easy. There are still strong reasons for doing the dark launch in this situation, because it gives you some degree of testing that you can’t reasonably get elsewhere, but you’ll need to invest significant effort to get the most from dark-launching.

Unless you’re doing a storage migration, you’ll need to make significant effort/payoff tradeoffs doing dark launches for mutating queries. The easiest option is to disable the mutates for the dark-launch traffic, returning a dummy response after the mutate is prepared but before it’s sent. This is safe, but it does mean that you’re not getting a full measurement of the dark launched service — what if it has a bug that causes 10% of the mutate requests to be incorrectly specified?

Alternatively, you might choose to send the mutation to a temporary duplicate of your existing storage. This is much better for the fidelity of your test, but great care will be needed to avoid sending real users the response from your temporary duplicate. It would also be very unfortunate for everyone if, at the end of your dark launch, you end up making the new service live when it’s still sending mutations to the temporary duplicate storage.

Storage migration

If you’re doing a storage migration — moving an existing system’s stored data from one storage system to another (for instance, MySQL to MongoDB because you’ve decided that you don’t really need SQL after all) — you’ll find that dark launches will be crucial in this migration, but you’ll have to be particularly careful about how you handle mutation-inducing queries. Eventually you’ll need mutations to take effect in both your old and new storage systems, and then you’ll need to make the new storage system the canonical storage for all user queries.

A good principle is that, during this migration, you should always make sure that you can revert to the old storage system if something goes wrong with the new one. You should know which of your systems (old and new) is the master for a given set of queries, and hence holds the canonical state. The mastership generally needs to be easily mutable and able to revert responsibility to the original storage system without losing data.

The universal requirement for a storage migration is a detailed written plan reviewed by not just your system stakeholders but also by your technical experts from the involved systems. Inevitably, your plan will miss things and will have to adapt as you move through the migration. Moving between storage systems can be an awfully big adventure — expect us to address this in a future blog post.

Duplicate traffic costs

The great thing about a well-implemented dark launch is that it exercises the full service in processing a query, for both the original and new service. The problem this brings is that each query costs twice as much to process. That means you should do the following:


  • Make sure your backends are appropriately provisioned for 2x the current traffic. If you have quota in other teams’ backends, make sure it’s temporarily increased to cover the dark launch as well.
  • If you’re connection-sensitive, ensure that your frontends have sufficient slack to accommodate a 2x connection count.
  • You should already be monitoring latency from your existing frontends, but keep a close eye on this monitoring stat and consider tightening your existing alerting thresholds. As service latency increases, service memory likely also increases, so you’ll want to be alert for either of these stats breaching established limits.


In some cases, the service traffic is so large that a 100% dark launch is not practical. In these instances, we suggest that you determine the largest percentage launch that is practical and plan accordingly, aiming to get the most representative selection of traffic in the dark launch. Within Google, we tend to launch a new service to Googlers first before making the service public. However, experience has taught us that Googlers are often not representative of the rest of the world in how they use a service.

An important consideration if your service makes substantial use of caching is that a sub-50% dark launch is unlikely to see material benefits from caching and hence will probably significantly overstate estimated load at 100%.

You may also choose to test-load your new service at over 100% of current traffic by duplicating some traffic — say, firing off two queries to the new service for every original query. This is fine, but you should scale your quota increases accordingly. If your service is cache-sensitive, then this approach will probably not be useful as your cache hit rate will be artificially high.

Because of the load impact of duplicate traffic, you should carefully consider how to use load shedding in this experiment. In particular, all dark launch traffic should be marked “sheddable” and hence be the first requests to be dropped by your system when under load.

In any case, if your service on-call sees an unexpected increase in CPU/memory/latency, they should drop the dark launch to 0% and see if that helps.

Summary

If you’re thinking about a dark launch for a new service, consider writing a dark launch plan. In that plan, make sure you answer the following questions:


  • Do you have existing traffic which you can fork and send to your new service?
  • Where will you fork the traffic: the application frontend, or somewhere else?
  • Will you fire off the message to the new backend asynchronously, or will you wait for it and impose a timeout?
  • What will you do with requests that generate mutations?
  • How and where will you log the responses from the original and new services, and how will you compare them?
    • Are you logging the following things: response code, backend latency, and response message size?
    • Will you be diffing responses? Are there fields that cannot meaningfully be diffed which you should skip in your comparison?
  • Have you made sure that your backends can handle 2x the current peak traffic, and have you given them temporary quota for it?
    • If not, at what percentage traffic will you stop the dark launch?
  • How are you going to select traffic for participation in the dark launch percentage: randomly, or by hashing on a key such as user ID?
  • Which teams need to know that this dark launch is happening? Do they know how to escalate concerns?
  • What’s your rollback plan after you make your new service live?


It may be that you don’t have enough surprises or excitement in your life; in that case, you don’t need to worry about dark launches. But if you feel that your service gives you enough adrenaline rushes already, dark launching is a great technique to make service launches really, really boring.

CRE life lessons: What is a dark launch, and what does it do for me?



Say you’re about to launch a new service. You want to make sure it’s ready for the traffic you expect, but you also don’t want to impact real users with any hiccups along the way. How can you find your problem areas before your service goes live? Consider a dark launch.

A dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user. (Note: We’ve also seen dark launches referred to as “feature toggles,” but this doesn’t generally capture the “dark” or hidden traffic aspect of the launch.)

Dark launches allow you to do two things:

  1. Verify that your new service handles realistic user queries in the same way as the existing service, so you don’t introduce a regression.
  2. Measure how your service performs under realistic load.
Dark launches typically transition gradually from a small percentage of the original traffic to a full (100%) dark launch where all traffic is copied to the new backend, discovering and resolving correctness and scaling issues along the way. If you already have a source of traffic for your new site — for instance, when you’re migrating from an existing frontend to a new frontend — then you’re an excellent candidate for a dark launch.


Where to fork traffic: clients vs. servers

When considering a dark launch, one key question is where the traffic copying/forking should happen. Normally this is the application frontend, i.e. the first service, which (after load balancing) receives the HTTP request from your user and calculates the response. This is the ideal place to do the fork, since it has the lowest friction of change — specifically, in varying the percentage of external traffic sent to the new backend. Being able to quickly push a configuration change to your application frontend that drops the dark launch traffic fraction back down to 0% is an important — though not crucial — requirement of a dark launch process.

If you don’t want to alter the existing application frontend, you could replace it with a new proxy service which does the traffic forking to both your original and a new version of the application frontend and handles the response diffing. However, this increases the dark launch’s complexity, since you’ll have to juggle load balancing configurations to insert the proxy before the dark launch and remove it afterwards. Your proxy almost certainly needs to have its own monitoring and alerting — all your user traffic will be going through it, and it’s completely new code. What if it breaks?

One alternative is to send traffic at the client level to two different URLs, one for the original service, and the other for the new service. This may be the only practical solution if you’re dark launching an entirely new app frontend and it’s not practical to forward traffic from the existing app frontend — for instance, if you’re planning to move a website from being served by an open-source binary to your own custom application. However, this approach comes with its own set of challenges.



The main risk in client changes is the lack of control over the client’s behavior. If you need to turn down the traffic to the new application, then you’ll at least need to push a configuration update to every affected mobile application. Most mobile applications don’t have a built-in framework for dynamically propagating configuration changes, so in this case you’ll need to make a new release of your mobile app. It also potentially doubles the traffic from mobile apps, which may increase user data consumption.

Another client change risk is that the destination change gets noticed, especially for mobile apps whose teardowns are a regular source of external publicity. Response diffing and logging results is also substantially easier within an application frontend than within a client.


How to measure a dark launch

It’s little use running a dark launch if you’re not actually measuring its effect. Once you’ve got your traffic forked, how do you tell if your new service is actually working? How will you measure its performance under load?

The easiest way is to monitor the load on the new service as the fraction of dark launch traffic ramps up. In effect, it’s a very realistic load test, using live traffic rather than canned traffic. Once you’re at 100% dark launch and have run over a typical load cycle — generally, at least one day — you can be reasonably confident that your server won’t actually fall over when the launch goes live.

If you’re planning a publicity push for your service, you should try to maximize the additional load you put on your service and adjust your launch estimate based on a conservative multiplier. For example, say that you can generate 3 dark launch queries for every live user query without affecting end-user latency. That lets you test how your dark-launched service handles three times the peak traffic. Do note, however, that increasing traffic flow through the system by this amount carries operational risks. There is a danger that your “dark” launch suddenly generates a lot of “light” — specifically, a flickering yellow-orange light which comes from the fire currently burning down your service. If you’re not already talking to your SREs, you need to open a channel to them right now to tell them what you’re planning.

Different services have different peak times. A service that serves worldwide traffic and directly faces users will often peak Monday through Thursday in the morning in the US as this is where users normally dominate traffic. By contrast, a service like a photo upload receiver is likely to peak on weekends when users take more photos, and will get huge spikes on major holidays like New Years. Your dark launch should try to cover the heaviest live traffic that it’s reasonable to wait for.

We believe that you should always measure service load during a dark launch as it is very representative data for your service and requires near-zero effort to do.

Load is not the only thing you should be looking at, however, as the following measurements should also be considered.


Logging needs

The point where incoming requests are forked to the original and new backends — generally, the application front end — is typically also the point where the responses come back. This is, therefore, a great place to record the responses for later analysis. The new backend results aren’t being returned to the user, so they’re not normally visible directly in monitoring at the application frontend. Instead, the application will want to log these responses internally.

Typically the application will want to log response code (e.g. 20x/40x/50x), latency of the query to the backend, and perhaps the response size, too. It should log this information for both the old and new backends so that the analysis can be a proper comparison. For instance, if the old backend is returning a 40x response for a given request, the new backend should be expected to return the same response, and the logs should enable developers to make this comparison easily and spot discrepancies.

We also strongly recommend that responses from original and new services are logged and compared throughout dark launches. This tells you whether your new service is behaving as you expect with real traffic. If your logging volume is very high, and you choose to use sampling to reduce the impact on performance and cost, make sure that you account in some way for the undetected errors in your traffic that were not included in the logs sample.

Timeouts as a protection

It’s quite possible that the new backend is slower than the original — for some or all traffic. (It may also be quicker, of course, but that’s less interesting.) This slowness can be problematic if the application or client is waiting for both original and new backends to return a response before returning to the client.

The usual approaches are either to make the new backend call asynchronous, or to enforce an appropriately short timeout for the new backend call after which the request is dropped and a timeout logged. The asynchronous approach is preferred, since the latter can negatively impact average and percentile latency for live traffic.

You must set an appropriate timeout for calls to your new service, and you should also make those calls asynchronous from the main user path, as this minimizes the effect of the dark launch on live traffic.


Diffing: What’s changed, and does it matter?

Dark launches where the responses from the old and new services can be explicitly diff’ed produce the most confidence in a new service. This is often not possible with mutations, because you can’t sensibly apply the same mutation twice in parallel; it’s a recipe for conflicts and confusion.

Diffing is nearly the perfect way to ensure that your new backend is drop-in compatible with the original. At Google, it’s generally done at the level of protocol buffer fields. There may be fields where it’s acceptable to tolerate differences, e.g. ordering changes in lists. There’s a trade-off between the additional development work required for a precise meaningful comparison and the reduced launch risk this comparison brings. Alternatively, if you expect a small number of responses to differ, you might give your new service a “diff error budget” within which it must fit before being ready to launch for real.

You should explicitly diff original and new results, particularly those with complex contents, as this can give you confidence that the new service is a drop-in replacement for the old one. In the case of complex responses, we strongly recommend either setting a diff “error budget” (accept up to 1% of responses differing, for instance) or excluding low-information, hard-to-diff fields from comparison.

This is all well and good, but what’s the best way to do this diffing? While you can do the diffing inline in your service, export some stats, and log diffs, this isn't always the best option. It may be better to offload diffing and reporting out of the service that issues the dark launch requests.

Within Google, we have a number of diffing services. Some run batch comparisons, some process data in live streams, others provide a UI for viewing diffs in live traffic. For your own service, work out what you need from your diffing and implement something appropriate.

Going live

In theory, once you’ve dark-launched 100% of your traffic to the new service, making it go “live” is almost trivial. At the point where the traffic is forked to the original and new service, you’ll return the new service response instead of the original service response. If you have an enforced timeout on the new service, you’ll change that to be a timeout on the old service. Job done! Now you can disable monitoring of your original service, turn it off, reclaim its compute resources, and delete it from your source code repository. (A team meal celebrating the turn-down is optional, but strongly recommended.) Every service running in production is a tax on support and reliability, and reducing the service count by turning off a service is at least as important as adding a new service.

Unfortunately, life is seldom that simple. (As id Software’s John Cash once noted, “I want to move to ‘theory,’ everything works there.”) At the very least, you’ll need to keep your old service running and receiving traffic for several weeks in case you run across a bug in the new service. If things start to break in your new service, your reflexive action should be to make the original service the definitive request handler because you know it works. Then you can debug the problem with your new service under less time pressure.

The process of switching services may also be more complex than we’ve suggested above. In our next blog post, we’ll dig into some of the plumbing issues that increase the transition complexity and risk.

Summary

Hopefully you’ll agree that dark launching is a valuable tool to have when launching a new service on existing traffic, and that managing it doesn’t have to be hard. In the second part of this series, we’ll look at some of the cases that make dark launching a little more difficult to arrange, and teach you how to work around them.

Making the most of an SRE service takeover – CRE life lessons



In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

Onboarding preparation

If a service entrance review determines that the service is suitable for SRE support, developers and the SRE team move into the “onboarding” phase, where they prepare for SREs to support the service.

While developers address the action items, the SRE team starts to familiarize itself with the service, building up service knowledge and familiarity with the existing monitoring tools, alerts and crisis procedures. This can be accomplished through several methods:
  • Education: present the new service to the rest of the team through tech talks, discussion sessions and "wheel of misfortune" scenarios.
  • “Take the pager for a spin”: share pager alerts with the developers for a week, and assess each page on the axes of criticality (does this indicate a user-impacting problem with the service?) and actionability (is there a clear path for the on-call to to resolve the underlying issue?). This gives the SRE team a quantitative measure of how much operational load the service is likely to impose.
  • On-call shadow: page the primary on-call developer and SRE at the same time. At this stage, responsibility for dealing with emergencies rests on the developer, but the developer and the SRE collaborate on debugging and resolving production issues together.

Measuring success


Q: I’ve gone through a lot of effort to make my service ready to hand over to SRE. How can I tell whether it was a good expenditure of scarce engineering time?

If the developer and SRE teams have agreed to hand over a system, they should also agree on criteria (including a timeframe) to measure whether the handover was successful. Such criteria may include (with appropriate numbers):
  • Absolute decrease of paging/outages count
  • Decreasing paging/outages as a proportion of (increasing) service scale and complexity.
  • Reduced time/toil from the point of new code passing tests to being deployed globally, and a flat (or decreasing) rollback rate.
  • Increased utilization of reserved resources (CPU, memory, disk etc.)
Setting these criteria can then prepare the ground for future handover proposals; if the success criteria for a previous handover were not met, the teams should carefully reconsider how this will change the handover plans for a new service.

Taking over the pager


Once all the blocking action items have been resolved, it’s time for SREs to take over the service pager. This should be a "no drama" event, with few, well-documented service alerts, that can be easily resolved by following procedures in the service playbook.

In theory, the SRE team will have identified most of these issues in the entrance review phase, but realistically there any many issues that are only apparent with sustained exposure to a service.

In the medium term (one to two months), SREs should build a list of deficiencies or areas for optimization in the system with regard to monitoring, resource consumption etc. This hitlist should primarily aim to reduce SRE “toil” (manual, repetitive, tactical work that has no enduring value), and secondarily fix aspects of the system, e.g., resource consumption or cruft accumulation, which can impact system performance. Tertiary changes may include things like updating the documentation to facilitate onboarding new SREs for system support.

In the long term (three to six months), SREs should expect to meet most or all of the pre-established measurements for takeover success as described above.

Q: That’s great, so now my developers can turn off their pager?

Not so fast, my friend. Although the SRE team has learned a lot about the service in the preceding months, they're still not experts; there will inevitably be failure modes involving arcane service behavior where the SRE on-call will not know what has broken, or how to fix it. There's no substitute for having a developer available, and we normally require developers to keep their on-call rotation so that the SRE on-call can page them if needed. We expect this to be a low rate of pages.

The nuclear option — handing back the pager


Not all SRE takeovers go smoothly, and even if the SREs have taken over the pager for a service, it’s possible for reliability to regress or operational load to increase. This might be for good reasons such as a “success disaster”  a sustained and unexpected spike in usage  or for bad reasons such as poor QA testing.

An SRE team can only handle so many services, and if one service starts to consume a disproportionate amount of SRE time, it's at risk of crowding out other services. In this case, the SRE team should proactively tell the developer team that they have a problem, and should do so in a neutral way that’s data-heavy:

In the past month we’ve seen 100 pages/week for service S1, compared to a steady rate of 20-30 pages/week over the past few weeks. Even though S1 is within SLO, the pages are dominating our operational work and crowding out service improvement work. You need to do one of the following:
  1. bring S1’s paging rate down to the original rate by reducing S1’s rate of change
  2. de-tune S1’s alerts so that most of them no longer page
  3. tell us to drop SRE support for services S2, S3 so our overall paging rate remains steady
  4. tell us to drop SRE support for S1
This lets the developer team decide what’s most important to them, rather than the SRE team imposing a solution.

There are also times when developers and SREs agree that handing back the pager to developers is the right thing to do, even if the operational load is normal. For example, imagine SREs are supporting a service, and developers come up with a new, shiny, higher-performing version. Developers support the new version initially, while working out its kinks, and migrate more and more users to it. Eventually the new version is the most heavily used  this is when SREs should take on the pager for the new service and hand the old service’s pager back to developers. Developers can then finish user migrations and turn down the old service at their convenience.

Converging your SRE and dev teams


Onboarding a service is about more than transferring responsibility from developers to SREs  it also improves mutual understanding between the two teams. The dev team gets to know what the SRE team does, and why, who the individual SREs are, and perhaps how they got that way. Similarly the SRE team gains a better understanding of the development team’s work and concerns. This increase in empathy is a Good Thing in itself, but is also an opportunity to improve future applications.

Now, when a developer team designs a new application or service, they should take the opportunity to invite the SRE team to the discussion. SRE teams can easily spot reliability issues in the design, and advise developers on ways to make the service easier to operate, set up good monitoring and configure sensible rollout policies from the start.

Similarly, when the SREs do future planning or design new tooling, they should include developers in the discussions; developers can advise them on future launches and projects, and give feedback on making the tools easier to operate or a better fit for developers’ needs.

Imagine that there was a brick wall between the SRE and developer teams; our original plan for service takeover was to throw the service over the wall and hope. Over the course of these blog posts, we’ve shown you how to make a hole in the wall so there can be two-way communication as the service is passed through, then expand it into a doorway so that SREs can come into the developers’ backyard and vice versa. Eventually, developers and SREs should tear down the wall entirely, and replace it with a low hedge and ornamental garden arch. SREs and developers should be able to see what’s going on in each others’ yard, and wander over to the other side as needed.


Summary


When an SRE takes on pager responsibility for developer-supported service, don’t just throw it over the fence into their yard. Work with the SRE team to help them understand how the service works and how it breaks, and to find ways to make it more resilient and easier to support. Make sure that supporting your service is a good use of the SRE team’s time, making use of their particular skills. With a carefully-planned handover process, you can both be confident that the queries will flow and your pagers will be (mostly) silent.