Tag Archives: CRE

Last month today: July on GCP

The month of July saw our Google Cloud Next ‘18 conference come and go, and there was plenty of exciting news, updates and demos to share from the show. Here’s a look at some of the most-read blog posts from July.

What caught your attention this month: Creating the open cloud
  • One of the most-read posts this month covered the launch of our Cloud Services Platform, which allows you to build a true hybrid cloud infrastructure. Some of the key components of Cloud Services Platform include the managed Istio service mesh, Google Kubernetes Engine (GKE) On-Prem and GKE Policy Management, Cloud Build for fully managed CI/CD, and several serverless offerings (more on that below). Combined, these technologies can help you gain consistency, security, speed and flexibility of the cloud in your local data center, along with the freedom of workload portability to the environment of your choice.
  • Another popular read was a rundown of Google Cloud’s new serverless offerings. These include core serverless compute announcements such as new App Engine runtimes, Cloud Functions general availability and more. It also included serverless containers, so you can run serverless workloads in a fully managed container environment; GKE Serverless add-on to easily run serverless workloads on Kubernetes Engine; and Knative, the open-source project on which that add-on is built. There are even more features included in this post, too, like Cloud Build, Stackdriver monitoring and Cloud Firestore integration with GCP. 
Bringing detailed metrics and Kubernetes apps to the forefront
  • Another must-read post this month for many of you was Transparent SLIs: See Google Cloud the way your application experiences it, announcing the availability of detailed data insights on GCP services that your workloads use—helping you see like a Google site reliability engineer (SRE). These new service-level indicators (SLIs) go way beyond basic uptime and downtime to delve into response codes, latency and more. You can then separate out metrics by GCP service to see things like API version, location and protocol. The result is that you can filter and sort to get extremely fine-grained information on your software and the GCP services you use, which helps cut resolution times and improve the support experience. Transparent SLIs are available now through the Stackdriver monitoring console. Learn more here about the basics of using SLIs and other SRE tools to measure and manage availability.
  • It’s also now faster and easier to find production-ready commercial Kubernetes apps in the GCP Marketplace. These apps are prepackaged and configured to get up and running easily, whether on Kubernetes Engine or other Kubernetes clusters, and run the gamut from security, data analytics and developer tools to storage, machine learning and monitoring.
There was obviously a lot to talk about at the show, and you can get even more detail on what happened at Next ‘18 here.

Building the cloud back-end
  • For all of you developing cloud apps with Java, the availability of Jib was an exciting announcement last month. This open-source container image builder, available as Gradle and Maven plugins, cuts out several steps from the Docker build flow. Jib does all the work required to package your app into a container image—you don’t need to write a Dockerfile or even have Docker installed. You end up with faster builds and reproducible container images.
  • And on that topic, this best practices for building containers post was a hit, too, giving you tips that will set you up to run your environment more smoothly. The tips in this blog post cover graceful application shutdowns, how to simplify containers and how to choose and tag the container images you’ll use. 
It’s been a busy month at GCP, and we’re glad to share lots of new tools with you. Till next time, build away!

SRE fundamentals: SLIs, SLAs and SLOs



Next week at Google Cloud Next ‘18, you’ll be hearing about new ways to think about and ensure the availability of your applications. A big part of that is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does day in and day out here at Google. Our SRE principles have as their end goal to improve services and in turn the user experience, and next week we’ll be discussing some new ways you can incorporate SRE principles into your operations.

In fact, a recent Forrester report on infrastructure transformation offers details on how you can apply these SRE principles at your company—more easily than you might think. They found that enterprises can apply most SRE principles either directly or with minor modification.

To learn more about applying SRE in your business, we invite you to join Ben Treynor, head of Google SRE, who will be sharing some exciting announcements and walking through real-life SRE scenarios at his Next ‘18 Spotlight session. Register now as seats are limited.

The concept of SRE starts with the idea that metrics should be closely tied to business objectives. We use several essential measurements—SLO, SLA and SLI—in SRE planning and practice.

Defining the terms of site reliability engineering

These measurements aren’t just useful abstractions. Without them, you cannot know if your system is reliable, available or even useful. If they don’t tie explicitly back to your business objectives, then you don’t have data on whether the choices you make are helping or hurting your business.

As a refresher, here’s a look at the key measurements of SRE, as discussed by AJ Ross, Adrian Hilton and Dave Rensin of our Customer Reliability Engineering team, in the January 2017 blog post, SLOs, SLIs, SLAs, oh my - CRE life lessons.


1. Service-Level Objective (SLO)

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

When we set out to define the terms of SRE, we wanted to set a precise numerical target for system availability. We term this target the Service-Level Objective (SLO) of our system. Any discussion we have in the future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.

Keep in mind that the more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with for each service, and state that as your SLO. Every service should have an SLO—without it, your team and your stakeholders cannot make principled judgments about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development). Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable.

Within Google, we implement periodic downtime in some services to prevent a service from being overly available. You might also try experimenting with planned-downtime exercises with front-end servers occasionally, as we did with one of our internal systems. We found that these exercises can uncover services that are using those servers inappropriately. With that information, you can then move workloads to somewhere more suitable and keep servers at the right availability level.

2. Service-Level Agreement (SLA)

At Google, we distinguish between an SLO and a Service-Level Agreement (SLA). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they will push hard to stay within SLA. If you’re charging your customers money, you will probably need an SLA.

Because of this, and because of the principle that availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over one month, with an internal availability SLO of 99.95%. Alternatively, the SLA might only specify a subset of the metrics that make up the SLO.

If you have an SLA that is different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You will also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries. That’s another benefit of establishing an SLA—it’s an unambiguous way to prioritize traffic.

When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, if a customer goes over quota because they released a buggy version of their mobile client, you may consider excluding all “out of quota” response codes from your SLA accounting.

3. Service-Level Indicator (SLI)

We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries; these will form the basis of your SLIs.

Since the original post was published, we’ve made some updates to Stackdriver that let you incorporate SLIs even more easily into your Google Cloud Platform (GCP) workflows. You can now combine your in-house SLIs with the SLIs of the GCP services that you use, all in the same Stackdriver monitoring dashboard. At Next ‘18, the Spotlight session with Ben Treynor and Snapchat will illustrate how Snap uses its dashboard to get insight into what matters to its customers and map it directly to what information it gets from GCP, for an in-depth view of customer experience.
Automatic dashboards in Stackdriver for GCP services enable you to group several ways: per service, per method and per response code any of the 50th, 95th and 99th percentile charts. You can also see latency charts on log scale to quickly find outliers.  

If you’re building a system from scratch, make sure that SLIs and SLOs are part of your system requirements. If you already have a production system but don’t have them clearly defined, then that’s your highest priority work. If you’re coming to Next ‘18, we look forward to seeing you there.

See related content:


Understanding error budget overspend – part one – CRE life lessons



In previous CRE Life Lessons blog posts, the Google Customer Reliability Engineering (CRE) team has spent a lot of time talking about service level objectives (SLOs), which measure whether your service is meeting its reliability targets from the point of view of its end users. Your SLO lets you specify how much downtime your service can have in a given period—for example, 43 minutes every 30 days for a service that needs to be available 99.9% of the time. This downtime allowance is your error budget. Like a household budget, it’s OK to spend this error budget over those 30 days, as long as you don’t spend more than that.

If you do run out of your error budget, either by spending a bit too much each day, or by having a major outage that blows it all at once, that tells you that your service’s users are suffering too much and it’s time to give them a break. How do you do that? Here are a few questions to consider to see if you need to recalibrate your error budget.

Where are you spending your error budget?

Your SLOs will be target values for corresponding service level indicators (SLIs), which are the measurements of the critical parts of the end-user experience. One SLI for the 99.9% available example system above might be “the percent of HTTP responses which are successful (200), out of all 20x and 50x HTTP responses.” You calculate your error budget spend by the percent of the measurement period where your service fails to reach all of its SLO targets; depending on the granularity and accuracy of your SLI measurement, this might be done on a per-minute, per-hour or even per-day basis.

When you analyze your error budget spend day-by-day, you should try to attribute the main causes of error budget spend over the measurement period:
  • Do most of your errors happen when you’re doing binary releases? That implies that you’re not going to be able to keep within budget unless you do something to make releases less frequent, less error-prone or lower-impact when there is an error.
  • Are you seeing steady error spend coming from intermittent application failure, which adds up to the majority of your budget? That’s telling you that you’ve got a fundamental failure in your application. It’s a strong signal that you need to drill down in your logs to find the troublesome queries, and that you should expect to dedicate some of your engineers to identify the root causes and either address them directly or plan to fix them in your next project planning cycle.
  • Are large chunks of your error budget getting spent by major application failures, where most of your service goes down for many minutes due to configuration pushes, excessive load or queries-of-death? You need to run effective postmortems to identify the root causes and mitigate them. You will need to redirect some of your development engineering effort to address the top action items from those postmortems—so feature development and releases will naturally happen more slowly. (More on this in another post.)
  • Is the bulk of your spend coming from a dependency outside your control, such as a critical backend or your compute platform? You’ll need to address the dependency or platform owner directly, showing them your SLI metrics and negotiating about how they can make their service more reliable—or how you can be more resilient to the expected failure modes.
For each of these cases, you have an objective measurement of whether the problem has been sufficiently addressed: you will expect your SLIs to stay high in circumstances where previously they plummeted.

Are you measuring the right signal?

Something else you should consider: Did the outage reflect real user pain? If you have a strong indicator that users weren’t concerned by a major outage that spent a chunk of your error budget, then you may not have to change your development practices or architecture, but you still have something to fix. Either you should determine a new, lower target level for your SLO, or you should find a different SLI that better represents the user experience.

Can your users tolerate a slightly worse experience?

Suppose you’re trying to run your service at a 99.9% availability level, with the corresponding 43-minute-per-month error budget, but you’re consistently failing to meet that; you’re spending 50-60 minutes per month. How much does that actually matter?

You probably have business intelligence channels for measuring customer happiness in terms of time spent on your site, purchase rate, support tickets raised and other fairly direct measurements of user happiness. Evaluate those statistics against your SLIs: Are your budget overspend periods correlated with less user happiness, and if so, what’s the correlation function? If a 50% error budget overspend corresponds to a 1% decrease in customer revenue, then you may feel that you can adjust your SLO target and aim for a 99.5% availability level, rather than spend a lot of engineering effort trying to raise your availability to the original target.

What is important in this case is to have, and document, the data used to determine the SLO target. You don’t want to fall into the trap of increasing your error budget by 50% each period because “users don’t really care”—you need to articulate the tradeoff in user happiness/spend vs. reliability in your SLO definition. An SLO specification shouldn’t just contain numbers and metric names. It should also reference the logic and data used to determine the SLO target.

When your users’ experience isn’t definitive

It may be true that the customer is always right— but what if your service’s users are part of your company? In some cases, the overall business decision may be that continuing to build and release the software is in the best interest of the company as a whole, even if you’re consistently going over budget. The error budget spend may cause an inconvenience to employees, but failing to release new versions of the software would have a significant cost to the company that outweighs user inconvenience.

This can occur when there's a disconnect between what the users of the software are perceived to need (for example, the 99.9% availability target of this example service) and what the executives who pay for the development of the software think these users should tolerate in the name of greater velocity.

Now that we understand what messages an error budget is telling us, in part two of this post we will look at how best to keep a positive balance.

Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!

Related content:

Good housekeeping for error budgets – part two – CRE life lessons



In part one of this CRE Life Lessons post, we talked about the error budget for a service and the information it tells you about how close your service is to breaching its reliability targets--the service-level objectives (SLOs). Once you’ve done some digging to understand why you may be consistently overspending your error budget, it’s time to fix the root causes of the problem.

Paying off your error budget debt

Those of us who have held a significant balance on a credit card are familiar with the large bite it can take out of a monthly household budget. Good housekeeping practice means that we should be looking to pay down as much of that debt as possible in order to shrink the monthly charge. Error budgets work the same way.

Once your error budget spend analysis identifies the major contributors to the spending rate, you should be prepared to redirect your developers’ efforts from new features to addressing those causes of spend. This might be an improved QA environment or test set to catch more release errors before they hit production, or better roll-out of automation and monitoring to detect and roll back bad releases more quickly.

The effect of this approach is likely to be that you make less frequent releases, or each release has fewer changes and hence is less likely to contain an error-budget-impacting problem. You’re slowing down release velocity temporarily in order to allow safer releasing at the original velocity in future.

Looking at downstream spend

Another issue to consider is: What if the error budget overspend wasn’t the developers’ fault? If your data center or cloud platform has a hardware outage, there’s not much the developers can do about it. Sure, your end users don’t care why the service broke, and you don’t want to make their lives worse, but it seems harsh to ding your developers for someone else’s failure. This should surface in your analysis of error budget spend, as described above.

What next? You may need to talk to the owners of that platform about their historical (measured) reliability and how it squares with you trying to run your service at your target SLO. It may be that changes are needed on both sides: You change your system to be able to detect and tolerate certain failures from them, and they improve detection and resolution time of the failure cases that impact you.

Often, a platform is not going to change significantly, so you have to decide how to account for that error spend in future. You may decide that it’s significant enough that you need to increase your resilience to it, e.g., by implementing (and exercising!) the option to fail your service automatically out of an affected cloud region over to an unaffected region. (See our “Defining SLOs for services with dependencies” blog post, which dealt with this problem in depth.)

When your releases are the problem

It could be, however, that your analysis leads you to the conclusion that software releases are a major source of your error budget spend. Certainly, our experience at Google is that binary rollouts are one of the top sources of outages; many a postmortem starts “We rolled out a new release of the software, which we thought did <X>, which our users would love, but in fact it did <Y>, which caused users to see errors in their browser/be unable to view photos of their cat/receive 100 email bounces a minute.”

The canonical response to a series of bad releases that overspend the error budget is to impose a freeze on the release of new features. This can be seen as a last-resort action; it acknowledges the existing efforts to pay down debt have not delivered sufficient reliability improvement, so lowering the rate of change is instead required to protect user experience. A freeze of this nature can also provide the space and direction to development teams to allow them to refocus their attention away from features onto reliability improvements. However, it’s a drastic step to take.

Other ways you can avoid freezing include:
  • Make an explicitly agreed-upon adjustment to the feature vs. reliability work balance. For example, your company normally does two-week sprints, where 95% of the work is feature-driven, and 5% is postmortem action items and other reliability work. You agree that while your service is out of SLO, the sprints will be instead be 50/50.
  • Overprovision your service. For instance, pay more money to replicate to another cloud zone and/or region, or run more replicas of your service to handle higher traffic loads. This is only effective if you have determined that this approach will help mitigate error budget spend.
  • Declare a reliability incident. Appoint a specific person to analyze the postmortem and error budget spend and come up with recommendations. It’s important that the business has pre-committed to prioritizing those recommendations.

Winter is coming

If you really have to impose a new features freeze, how long should it last? Generally, it should last until you have removed the overspend in your error budget, and have confidence it will not recur. We’ve seen two principal methods of error budget calculation: fixed intervals (say, each calendar month) and rolling intervals (the last N days).

If you operate a fixed interval for your error budget calculation, your reaction to an error budget overspend depends on when it happens. If it happens on day 1, you spend the whole month frozen; if it’s on day 28, you may not actually need to stop releasing because your next release may be in the next month, when the error budget is reset. Unless your customer is also sensitive to outages on a calendar month basis, this doesn’t seem to be a good practice to optimize your customers’ experience.

For a rolling 30-day error budget measurement period, your 99.9% available service gains the error budget lost in day N-30, so if your budget is 20 minutes overspent, now you need to wait until that 20 minutes of debt has dropped off your radar. So if you spent 15 minutes of your budget on day N-29 and five minutes on day N-28, you’d need to wait two more days to get back to a positive balance, assuming no further outages. In practice, you’d probably wait until you accumulate a buffer of 20% of your error budget so you are resilient to minor unexpected spends.

Following this guidance, if you have a major outage that spends your entire month’s budget in one day, then you’d be frozen for an entire month. In practice, this may not be acceptable. At the very least, though, you should be drastically down-scaling your release velocity in order to have more engineering time to devote to fixing the root causes (see “Paying off your error budget debt” above). There are other approaches, too: Check out the discussion about blocking releases in an earlier episode of CRE Life Lessons, where we analyzed an example escalation policy.

As you can see, the rolling period for error budget measurement is less prone to a varying reaction depending on the particular date of an outage. We recommend that you adopt this approach if you can, though admittedly it can be challenging to accumulate this kind of data in monitoring tools currently.

The long-term costs of freezes

Freezing the release of new features isn’t free of cost. In a worst-case scenario, if your developers are continuing new feature development but not releasing those features to users, the changes will build up, and when you finally resume releases it is almost inevitable that you’re going to see a series of broken releases. We’ve seen this happen in practice: if we impose a freeze on a service over an event like Black Friday or New Year’s, we expect that the week following the freeze will be unusually busy with service failures as all the backed-up changes reach users. To avoid this, it’s important to re-emphasize to teams affected by the freeze that it is intended to provide space to focus on reliability development, not feature development.

Sometimes it’s not possible to freeze all releases. Your company may have a major event coming up, such as a conference, and so there’s a compelling need to push certain new features into production no matter what the recent experience of its users. One process you could adopt in this case is the concept of a silver bullet: The product management team has a (very limited) right to override a release freeze to deploy a critical feature. To make this approach work well, that right needs to be expensive to exercise and limited in frequency: The spend of a silver bullet should be regarded as a failure, and require a postmortem to analyze how it came about and how to mitigate the risk of it happening again.

Using the error budget to your (and your users’) advantage

An error budget is a crucial concept when you’re taking a principled approach to service reliability. Like a household budget, it’s there for you (the service owner) to spend, and it's important for the service stakeholders to agree on what should happen when you overspend it ahead of doing so. If you find you’ve overspent, a feature freeze can be an effective tool to prioritize development time toward reliability improvements. But remember that reflexively freezing your releases when you blow through your error budget isn’t always the appropriate response. Consider where your budget is being spent, how to reduce the major sources of spend and whether some loosening of the purse strings is in order. The most important principle: Do it based on data!

Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!

Related content:

Applying the Escalation Policy — CRE life lessons



In past posts, we’ve discussed the importance of creating an explicit policy document describing how to escalate SLO violations, and given a real-world example of a document from an SRE team at Google. This final post is an exercise in hypotheticals, to provide some scenarios that exercise the policy and illustrate edge cases. The following scenarios all assume a "three nines" availability SLO for a service that burns half its error budget on background errors, i.e., our error budget is 0.1% errors, and serving 0.05% errors is "normal."

First, let's recap the policy thresholds:

  • Threshold 1: Automated alerts notify SRE of an at-risk SLO 
  • Threshold 2: SREs conclude they need help to defend SLO and escalate to devs 
  • Threshold 3: The 30-day error budget is exhausted and the root cause has not been found; SRE blocks releases and asks for more support from the dev team 
  • Threshold 4: The 90-day error budget is exhausted and the root cause has not been found; SRE escalates to executive leadership to commandeer more engineering time for reliability work

With that refresher in mind, let’s dig in to these SLO violations.

Scenario 1: A short but severe outage is quickly root-caused to a dependency problem

Scenario: A bad push of a critical dependency causes a 50% outage for an hour as the team responsible for the dependency rolls back the bad release. The error rate returns to previous levels when the rollback completes, and the team identifies the commit that is the root cause and reverts it. The team responsible for the dependency writes a postmortem to which SREs for the service contribute some AIs to prevent recurrence.

Error budget: Assuming three nines, the service burned 70% of a 30-day error budget during this outage alone. It has exceeded the 7-day budget, and, given background errors, the 30-day budget as well.

Escalation: SRE was alerted to deal with the impact to production (Threshold 1). If SRE judges the class of issue (bad config or binary push) to be sufficiently rare or adequately mitigated, then the service is "brought back into SLO", and the policy requires no other escalation. This is by design—the policy errs on the side of development velocity, and blocking releases solely because the 30-day error budget is exhausted goes against that.

If this is a recurring issue (e.g., occurring two weeks in a row, burning most of a quarter's error budget) or is otherwise judged likely to recur, then it’s time to escalate to Threshold 2 or 3.

Scenario 2: A short but severe outage has no obvious root cause


Scenario: The service has a 50% outage for an hour, cause unknown.

Error budget: Same as previous scenario: the service has exceeded both its 7-day and 30-day error budgets.

Escalation: SRE is alerted to deal with the impact to production (Threshold 1). SRE escalates quickly to the dev team. They may request new metrics to provide additional visibility of the problem, install fallback alerting, and the SRE and dev oncall prioritize investigating the issue for the next week (Threshold 2).

If the root cause continues to evade understanding, SRE pauses feature releases after the first week until the outage passes out of the 30-day SLO measurement window. More of the SRE and dev teams are pulled from their project work to debug or try to reproduce the outage—this is their number-one priority until the service is back within SLO or they find the root cause. As the investigation continues, SRE and dev teams shift towards mitigation, work-arounds and building defense-in-depth. Ideally, by the time the outage passes over the SLO horizon, SRE is confident that any recurrence will be easier to root-cause and will not have the same impact. In this situation, they can justify resuming release pushes even without having identified a concrete root cause.

Scenario 3: A slow burn of error budget with a badly attributed root cause


Scenario: A regression makes it into prod at time T, and the service begins serving 0.15% errors. The root cause of the regression eludes both SRE and developers for weeks; they attempt multiple fixes but don’t reduce the impact.

Error budget: If left unresolved, this burns 1.5 months' error budget per month.

Escalation: The SRE oncall is notified via ticket at around T+5 days, when the 7-day budget is burned. SRE triggers Threshold 2 and escalates to the dev team at about T+7 days. Because of some correlations with increased CPU utilization, the SRE and dev oncall hypothesize that the root cause of the errors is a performance regression. The dev oncall finds some simple optimizations and submits them; these get into production at T+16 days as part of the normal release process. The fixes don't resolve the problem, but now it is impractical to roll back two weeks of daily code releases for the affected service.

At T+20 days, the service exceeds its 30-day error budget, triggering Threshold 3. SRE stops releases and escalates for more dev engagement. The dev team agrees to assemble a working party of two devs and one SRE, whose priority is to root-cause and remedy the problem. With no good correlation between code changes and the regression, they start doing in-depth performance profiling and removing more bottlenecks. All the fixes are aggregated into a weekly patch on top of the current production release. The efficiency of the service increases noticeably, but it still serves an elevated rate of errors.

At T+60 days, the service expends its 90-day error budget, triggering Threshold 4. SRE escalates to executive leadership to ask for a significant quantity of development effort to address the ongoing lack of reliability. The dev team does a deep dive and finds some architectural problems, eventually reimplementing some features to close out an edge-case interaction between server and client. Error rates revert to their previous state and blocked features roll out in batches once releases begin again.

Scenario 4: Recurring, transient SLI excursions


Scenario: Binary releases cause transient error spikes that occur with the daily or weekly release.

Error budget: If the errors don’t threaten the long term SLO, it's entirely SRE's responsibility to ensure the alert is properly tuned. So, for the sake of this scenario, assume that the error spikes don’t threaten the SLO over any window longer than a few hours.

Escalation: SLO-based alerting initially notifies SREs, and the escalation path proceeds as in the slow-burn case. In particular, the issue should not be considered "resolved" simply because the SLI returns to normal between releases, since the bar for bringing a service back into SLO is set higher than that. SRE may tune the alert so that a longer or faster burn of error budget triggers a page, and may decide to investigate the issue as part of their ongoing work on the service’s reliability.

Summary


It's important to “war-game” any escalation policy with hypothetical scenarios to make sure there are no unexpected edge cases and to check that the wording is clear. Circulate draft policies widely and address concerns that are raised in appendices, but don't clutter the policy itself with justifications of the chosen inflection points. Expect to have an extended discussion if your peers find your proposals contentious. Remember that the example shared here is just thata real-life SRE team looking to meet a high availability target would likely structure their escalation policy quite differently.

An example escalation policy — CRE life lessons



In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making "revoking support" for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.

It's important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.

Escalation policy preamble


Before getting into the specifics of the escalation policy, it's important to consider the following broad points.

The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It's structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.

Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.

For the following four thresholds in the escalation policy, "bringing a service back into SLO" means:
  • finding the root cause and fixing the relevant class of issue, or
  • automating remediation such that ongoing manual intervention is no longer necessary, or 
  • simply waiting one week, if the class of issue is extremely unlikely to recur with frequency and severity sufficient to threaten the SLO in the future
In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it's unlikely to recur or to automate remediation.


Escalation policy thresholds


Threshold 1 -  wherein SRE are notified that an SLO is potentially impacted

SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.

Threshold 2 - wherein SRE escalates to the developers
  • If,
    • SRE have concluded they cannot bring the service into SLO without help, and
    • SRE and dev agree that the SLO represents desired user experience
  • Then,
    • SRE and dev on-calls prioritize fixing the root cause and update the bug daily
    • SRE escalates to dev leads for visibility and additional assistance if necessary
    • Alerting thresholds may be relaxed to avoid continually paging for the known issue, while continuing to provide protection against further regressions
  • When the service is brought back into SLO,
    • SRE will revert any alerting changes
    • SRE may create a postmortem
    • Or, if the SLO does not accurately represent desired user experience, the SRE, dev and product teams will agree to change or retire the SLO
Threshold 3 - wherein SRE pauses feature releases and focuses on reliability
  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 30-day error budget is exhausted
  • Then during the following week,
    • Only cherry-picked fixes for diagnosed root causes may be pushed to production
    • SRE may escalate to their leadership and dev management to request that members of the dev team prioritize finding and fixing the root cause over any non-emergency work
    • Daily updates may be made to an "escalations" mailing list (used to broadcast information about outages to a wide audience, including executive leadership).
  • When the service is brought back into SLO,
    • Normal binary releases resume
    • SRE creates a postmortem
    • Team members may re-prioritize normal project work

Threshold 4 - wherein SRE may escalate or revoke support

  • If,
    • Conditions for the previous threshold are met for at least one week, and
    • The service has not been brought back into SLO, and
    • The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability
  • Then,
    • SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem
    • SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting

On escalation and incident response


SREs are first responders, and there's an expectation that they'll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.

Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don't need to ask the question over and over again, though that's a strong signal that further escalation is necessary. It's also OK to insist that developers take back the pager for the service until they're willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.

On blocking releases


Blocking releases is an appropriate course of action for three main reasons:
  1. Commonly, the largest source of burnt error budget at steady state is the release push. If you’ve already burned all your budget, not pushing new releases lowers the steady-state burn rate, bringing the service back into SLO more quickly
  2. It eliminates the risk of further unexpected SLO violations due to bugs in new code. This is also why any fixes for diagnosed root causes must be patched into the current release, rather than rolling forward to a new release
  3. While blocking releases is not intended as a punitive measure, it does directly impact release velocity, which the dev org cares about deeply. As such, tying SLO violations to reduced velocity aligns the incentives of both organizations. SRE wants the service to stay within SLO, the dev org wants to build new features quickly. This way, either both happen or neither do.
SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively "borrowing" future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.

SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there's less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.

Summary


We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we'll take these policy thresholds out for a spin with some hypothetical scenarios.

Consequences of SLO violations — CRE life lessons



Previous episodes of CRE life lessons have talked in detail about the importance of quantifying a service's availability and using SLOs to manage the competing priorities of features-focused development teams ("devs") versus a reliability-focused SRE team. Good SLOs can help reduce organizational friction and maintain development velocity without sacrificing reliability. But what should happen when SLOs are violated?

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy. Future posts will go over an example taken from an SRE team here at Google, and work through some scenarios that put that policy into action.

Features or reliability?


In the ideal world (assuming spherical SREs in a vacuum), an SLO represents the dividing line between two binary states: developing new features when there's error budget to spare, and improving service reliability when there isn't. Most real engineering organizations will instead vary their effort on a spectrum between these two extremes as business priorities dictate. Even when a service is operating well within its SLOs, choosing to do some proactive reliability work may reduce the risk of future outages, improve efficiency and provide cost savings; conversely it's rare to find an organization that completely drops all in-flight feature development as soon as an SLO is violated.

Describing key inflection points from that spectrum in a policy document is an important part of the relationship between an SRE team and the dev teams with whom they partner. This ensures that all parts of the organization have roughly the same understanding around what is expected of them when responding to (soon to be) violated SLOs, and – most importantly – that the consequences of not responding are clearly communicated to all parties. The exact choice of inflection points and consequences will be specific to the organization and its business priorities.

Inflection points


Having a strong culture of blameless postmortems and fixing root causes should eventually mean that most SLO violations are unique – informally, “we are in the business of novel outages.” It follows that the response to each violation will also be unique; making judgement calls around these is a part of an SREs job when responding to the violation. But a large variance in the range of possible responses results in inconsistency of outcomes, people trying to game the system and uncertainty for the engineering organization.

For the purposes of an escalation policy, we recommend that SLO violations be grouped into a few buckets of increasing severity based on the cumulative impact of the violation over time (i.e., how much error budget has been burned over what time horizon), with clearly defined boundaries for moving from one bucket to another. It's useful to have some business justification for why violations are grouped as they are, but this should be in an appendix to the main policy to keep the policy itself clear.

It's a good idea to tie at least some of the bucket boundaries to any SLO-based alerting you have. For example, you may choose to page SREs to investigate when 10% of the weekly error budget has been burned in the past hour; this is an example of an inflection point tied to a consequence. It forms the boundary between buckets we might informally title "not enough error budget burned to notify anyone immediately" and "someone needs to investigate this right now before the service is out of its long-term SLO." We'll examine more concrete examples in our next post, where we look at a policy from an SRE team within Google.

Consequences


The consequences of a violation are the meat of the policy. They describe actions that will be taken to bring the service back into SLO, whether this is by root causing and fixing the relevant class of issue, automating any stop-gap mitigation tasks or by reducing the near-term risk of further deterioration. Again, the choice of consequence for a given threshold is going to be specific to the organization defining the policy, but there are several broad areas into which these fall. This list is not exhaustive!

Notify someone of potential or actual SLO violation

The most common consequence of any potential or actual SLO violation is that your monitoring systems tells a human that they need to investigate and take remedial action. For a mature, SRE-supported service, this will normally be in the form of a page to the oncall when a large quantity of error budget has been burned over a short window, or a ticket when there’s an elevated burn rate over a longer time horizon. It's not a bad idea for that page to also create a ticket in which you can record debugging details, use as a centralized communication point and reference when escalating a serious violation.

The relevant dev team should also be notified. It's OK for this to be a manual process; the SRE team can add value by filtering and aggregating violations and providing meaningful context. But ideally a small group of senior people in the dev team should be made aware of actual violations in an automated fashion (e.g., by CCing them on any tickets), so that they're not surprised by escalations and can chime in if they have pertinent information.

Escalate the violation to the relevant dev team

The key difference between notification and escalation is the expectation of action on the part of the dev team. Many serious SLO violations require close cooperation between SREs and developers to find the root cause and prevent recurrence. Escalation is not an admission of defeat. SREs should escalate as soon as they're reasonably sure that input from the dev team will meaningfully reduce the time to resolution. The policy should set an upper bound on the length of time an SLO violation (or near miss) can persist without escalation.

Escalation does not signify the end of SRE’s involvement with an SLO violation. The policy should describe the responsibilities of each team and a lower bound on the amount of engineering time they should divert towards investigating the violation and fixing the root cause. It will probably be useful to describe multiple levels of escalation, up to and including getting executive-level support to commandeer the engineering time of the entire dev team until the service is reliable.

Mitigate risk of service changes causing further impact to SLOs

Since a service in violation of its SLO is by definition making users unhappy, day-to-day operations that may increase the rate at which error budget is burned should be slowed or stopped completely. Usually, this means restricting the rate of binary releases and experiments, or stopping them completely until the service is again within SLO. This is where the policy needs to ensure all parties (SRE, development, QA/testing, product and execs) are on the same page. For some engineering organizations, the idea that SLO violations will impact their development and release velocity may be difficult to accept. Reaching a documented agreement on how and when releases will be blocked – and what fraction of engineers will be dedicated to reliability work when this occurs – is a key goal.

Revoke support for the service

If a service is shown to be incapable of meeting its agreed-upon SLOs over an extended time period, and the dev team responsible for that service is unwilling to commit to engineering improvements to its reliability, then SRE teams at Google have the option of handing back the responsibility for running that service in production. This is unlikely to be the consequence of a single SLO violation, rather the combination of multiple serious outages over an extended period of time, where postmortem AIs have been assigned to the dev team but not prioritized or completed.

This has worked well at Google, because it changes the incentives behind any conversation around engineering for reliability. Any dev team that neglects the reliability of a service knows that they will bear the consequences of that neglect. By definition, revoking SRE support for a service is a last resort, but stating the conditions that must be met for it to happen makes it a matter of policy, not an idle threat. Why should SRE care about service reliability if the dev team doesn't?

Summary


Hopefully this post has helped you think about the trade-off between engineering for reliability and features, and how responding to SLO violations moves the needle towards reliability. In our next post, we'll present an escalation policy from one of Google's SRE teams, to show the choices they made to help the dev teams they partner with maintain a high development velocity.

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

What a year! Google Cloud Platform in 2017



The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:
  1. You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
  2.  How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
  3.  Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
  4.  You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems. So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure

If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all


Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development


When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our "Partnering on open source" series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation


In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

Getting the most out of shared postmortems — CRE life lessons



In our previous post we discussed the benefits of sharing internal postmortems outside your company. You may adopt a one:many approach with an incident summary that tells all your customers what happened and how you'll prevent it from happening again. Or, if the incident impacted a major customer, you may share something close to your original postmortem with them.

In this post, we consider how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices. We also present a worked example of a shared postmortem based on the SRE Book postmortem template.


Postmortems should fix your customer too

How to get outages to benefit everyone

Even if the fault was 100% on you, the platform side, an external postmortem can still help customers improve their reliability. Now that we know what happens when a particular failure occurs, how can we generalize this to help the customer mitigate the impact, and reduce MTTD and MTTR for a similar incident in the future?

One of the best sources of data for any postmortem is your customers’ SLOs, with their ability to measure the impact of a platform outage. Our CRE team talks about SLOs quite a lot in the CRE Life Lessons series, and there’s a reason why: SLOs and error budgets inform more than just whether to release features in your software.

For customers with defined SLOs who suffered a significant error budget impact, we recommend conducting a postmortem review with them. The review is partly to ensure that the customer’s concerns were addressed, but also to identify “what went wrong,” “where we got lucky” and how to identify actions which would address these for the customer.

For example, the platform’s storage service suffered increased latency for a certain class of objects in a region. This is not the customer’s fault, but they may still be able to do something about it.

The internal postmortem might read something like:

What went well

  • The shared monitoring implemented with CustomerName showed a clear single-region latency hit which resulted in a quick escalation to storage oncall. 
What went wrong 

  • A new release of the storage frontend suffered from a performance regression for uncached reads that was not detected during testing or rollout. 
Where we got lucky 

  • Only reads of objects between 4KB and 32KB in size were materially affected. 
Action items

  • Add explicit read/write latency testing in testing for both cached and uncached objects in buckets of 1KB, 4KB, 32KB, … 
  • Have paging alerts for latency over SLO limits, aggregated by Cloud region, for both cached and uncached objects, in buckets of 1KB, 4KB, 32KB, ... 
When a customer writes their own postmortem about this incident, using the shared postmortem to understand better what broke in the platform and when, that postmortem might look like:

What went well

  • We had anticipated a generic single-region platform failure and had the capability to fail over out of an affected region. 
What went wrong 

  • Although the latency increase was detected quickly, we didn’t have accessible thru-stack monitoring that could show us that it was coming from platform storage-service rather than our own backends. 
  • Our decision to fail out of the affected region took nearly 30 minutes to complete because we had not practiced it for one year and our playbook instructions were out of date. 
Where we got lucky 

  • This happened during business hours so our development team was on hand to help diagnose the cause. 
Action items 

  • Add explicit dashboard monitoring for aggregate read and write latency to and from platform storage-service. 
  • Run periodic (at least once per quarter) test failovers out of a region to validate that the failover instructions still work and increase ops team confidence with the process. 

Prioritize and track your action items 


A postmortem isn’t complete until the root causes have been fixed 

Sharing the current status of your postmortem action items is tricky. It's unlikely that the customer will be using the same issue tracking system as you are, so neither side will have a “live” view of which action items from a postmortem have been resolved, and which are still open. Within Google we have automation which tracks this and “reminds” us of unclosed critical actions from postmortems, but customers can’t see those unless we surface them in the externally-visible part of our issue tracking system, which is not our normal practice.

Currently, we hold a monthly SLO review with each customer, where we list the major incidents and postmortem/incident report for each incident; we use that occasion to report on open critical bug statuses from previous months’ incidents, and check to see how the customer is doing on their actions.

Other benefits 

Opening up is an opportunity 

There are practical reliability benefits of sharing postmortems, but there are other benefits too. Customers who are evolving towards an SRE culture and adopting blameless postmortems can use the external postmortem as a model for their own internal write-ups. We’re the first to admit that it’s really hard to write your own first postmortem from scratch—having a collection of “known-good” postmortems as a reference can be very helpful.

At a higher level, shared postmortems give your customer a “glimpse behind the curtain.” When a customer moves from on-premises hardware to the cloud, it can be frightening; they're giving up a lot of control of and visibility into the platform on which their service runs. The cloud is expected to encapsulate the operational details of the services it offers, but unfortunately it can be guilty of hiding information that the customer really wants to see. A detailed external postmortem makes that information visible, giving the customer a timeline and deeper detail, which hopefully they can relate to.

Joint postmortems

If you want joint operations, you need joint postmortems 

The final step in the path to shared postmortems is creating a joint postmortem. Until this point, we’ve discussed how to externalize an existing document, where the action items, for example, are written by you and assigned to you. With some customers, however, it makes sense to do a joint postmortem where you both contribute to all sections of the document. It will not only reflect your thoughts from the event, but it will also capture the customer’s thoughts and reactions, too. It will even include action items that you assign to your customer, and vice-versa!

Of course, you can’t do joint postmortems with large numbers of your customers, but doing so with at least a few of them helps you (a) build shared SRE culture, and (b) keep the customer perspective in your debugging, design and planning work.

Joint postmortems are also one of the most effective tools you have to persuade your product teams to re-prioritize items on their roadmap, because they present a clear end-user story of how those items can prevent or mitigate future outages.


Summary 


Sharing your postmortems with your customers is not an easy thing to do; however, we have found that it helps:

  • Gain a better understanding of the impact and consequences of your outages
  • Increase the reliability of your customers’ service
  • Give customers confidence in continuing to run on your platform even after an outage.

To get you started, here's an example of an external postmortem for the aforementioned storage frontend outage, using the SRE Book postmortem template. (Note: Text relating to the customer (“JaneCorp”) is marked in purple for clarity.) We hope it sets you on the path to learning and growing from your outages. Happy shared postmortem writing!