Tag Archives: CRE

Fearless shared postmortems — CRE life lessons



We here on Google’s Site Reliability Engineering (SRE) teams have found that writing a blameless postmortem — a recap and analysis of a service outage — makes systems more reliable, and helps service owners learn from the event.

Postmortems are easy to do within your company — but what about sharing them outside your organization? Why indeed, would you do this in the first place? It turns out that if you're a service or platform provider, sharing postmortems with your customers can be good for you and them too.

In this installment of CRE Life Lessons, we discuss the benefits and complications that external postmortems can bring, and some practical lessons about how to craft them.

Well-known external postmortems 

There is prior art, and you should read it. 

Over the years, we’ve had our share of outages and recently, we’ve been sharing more detail about them than we used to. For example, on April 11 2016 Google Compute Engine dropped inbound traffic, resulting in this public incident report.

Other companies are also publishing detailed postmortems about their own outages. Who can forget the time when:

We in Google SRE love reading these postmortems — and not because of schadenfreude. Indeed, many of us read them and think “there but for the grace of (Deity) go we” and wonder whether we would withstand a similar failure. Indeed, when you’re thinking this, it’s a good time to run a DiRT exercise.

For platform providers that offer a wide range of services to a wide range of users, fully public postmortems such as these make sense (even though they're a lot of work to prepare and open you up to criticism from competitors and press). But even if the impact of your outage isn’t as broad, if you are practising SRE, it can still make sense to share postmortems with customers that have been directly impacted. Caring about your customers’ reliability means sharing the details of your outages.

This is the position we take on the Google Cloud Platform (GCP) Customer Reliability Engineering. To help customers run reliably on GCP, we teach them how to engineer increased reliability for their service by implementing SRE best practices in our work together. We identify and quantify architectural and operational risks to each customer’s service, and work with them to mitigate those risks and drive to sustain system reliability at their SLO (Service Level Objectives) target.

Specifically, the CRE team works with each customer to help them meet the availability target expressed by their SLOs. For this, the principal steps are to:

  1. Define a comprehensive set of business-relevant SLOs
  2. Get the customer to measure compliance to those SLOs in their monitoring platform (how much of the service error budget has been consumed)
  3. Share that live SLO information with Google support and product SRE teams (which we term shared monitoring)
  4. Jointly monitor and react to SLO breaches with the customer (shared operational fate)
If you run a platform — or some approximation thereof — then you too should practice SRE with your customers to get that increased reliability, prevent your customers from tripping over your changes, and gain better insights into the impact and scope of your failures.

Then, when an incident occurs that causes the service to exceed its error budget — or consumes an unacceptably high proportion of the error budget — the service owner needs to determine:

  1. How much of the error budget did this consume in total?
  2. Why did the incident happen?
  3. What can / should be done to stop it from happening again? 
Answering Question #1 is easy, but the mechanism for evaluating Questions #2 and #3 is a postmortem. If the incident root cause was purely on the customer’s side, that’s easy — but what if the trigger was an event on your platform side? This is when you should consider an external postmortem.

Foundations of an external postmortem


Analyzing outages — and subsequently writing about them in a postmortem — benefits from having a two-way flow of monitoring data between the platform operator and the service owner, which provides an objective measure of the external impact of the incident: When did it start, how long did it last, how severe was it, and what was the total impact on the customer’s error budget? Here on the GCP CRE team, we have found this particularly useful, since it's hard to estimate the impact of problems in lower-level cloud services on end users. We may have observed a 1% error rate and increased latency internally, but was it noticeable externally after traveling through many layers of the stack?

Based on the monitoring data from the service owner and their own monitoring, the platform team can write their postmortem following the standard practices and our postmortem template. This results in an internally reviewed document that has the canonical view of the incident timeline, the scope and magnitude of impact, and a set of prioritized actions to reduce the probability of occurrence of the situation (increased Mean Time Between Failures), reduce the expected impact, improve detection (reduced Mean Time To Detect) and/or recover from the incident more quickly (reduced Mean Time To Recover).

With a shared postmortem, though, this is not the end: we want to expose some —though likely not all — of the postmortem information to the affected customer.

Selecting an audience for your external postmortem


If your customers have defined SLOs, they (and you) know how badly this affected them. Generally, the greater the error budget that has been consumed by the incident, the more interested they are in the details, and the more important it will be to share with them. They're also more likely to be able to give relevant feedback to the postmortem about the scope, timing and impact of the incident, which might not have been apparent immediately after the event.

If your customer’s SLOs weren’t violated but this problem still affected their customers, that’s an action item for the customer’s own postmortem: what changes need to be made to either the SLO or its measurements? For example, was the availability measurement further down in the stack compared to where the actual problem occurred?

If your customer doesn’t have SLOs that represent the end-user experience, it’s difficult to make an objective call about this. Unless there are obvious reasons why the incident disproportionately affected a particular customer, you should probably default to a more generic incident report.

Another factor you should consider is whether the customers with whom you want to share the information are under NDA; if not, this will inevitably severely limit what you're able to share.

If the outage has impacted most of your customers, then you should consider whether the externalized postmortem might be the basis for writing a public postmortem or incident report, like the examples we quoted above. Of course, these are more labor-intensive than external postmortems shared with select customers (i.e., editing the internal postmortem and obtaining internal approvals), but provide additional benefits.

The greatest gain from a fully public postmortem can be to restore trust from your user base. From the point of view of a single user of your platform, it’s easy to feel that their particular problems don’t matter to you. A public postmortem gives them visibility into what happened to their service, why, and how you're trying to prevent it from happening again. It’s also an opportunity for them to conduct their own mini-postmortem based on the information in the public post, asking themselves “If this happened again, how would I detect it and how could I mitigate the effects on my service?”

Deciding how much to share, and why?


Another question when writing external postmortems is how deep to get into the weeds of the outage. At one end of the spectrum you might share your entire internal postmortem with a minimum of redaction; at the other you might write a short incident summary. This is a tricky issue that we’ve debated internally.

The two factors we believe to be most important in determining whether to expose the full detail of a postmortem to a customer, rather than just a summary, are:

  1. How important are the details to understanding how to defend against a future re-occurrence of the event?
  2. How badly did the event damage their service, i.e., how much error budget did it consume? 
As an example, if the customer can see the detailed timeline of the event from the internal postmortem, they may be able to correlate it with signals from their own monitoring and reduce their time-to-detection for future events. Conversely, if the outage only consumed 8% of their 30-day error budget then all the customer wants to know is whether the event is likely to happen more often than once a month.

We have found that, with a combination of automation and practice, we can produce a shareable version of an internal postmortem with about 10% additional work, plus internal review. The downside is that you have to wait for the postmortem to be complete or nearly complete before you start. By contrast, you can write an incident report with a similar amount of effort as soon as the postmortem author is reasonably confident in the root cause.

What to say in a postmortem 


By the time the postmortem is published, the the incident has been resolved, and the customer really cares about three questions:

  1. Why did this happen? 
  2. Could it have been worse? 
  3. How can we make sure it won’t happen again?


“Why did this happen?” comes from the “Root causes and Trigger” and “What went wrong” sections of our postmortem template. “Could it have been worse?” comes from “Where we got lucky.”
These are two sections which you should do your best to retain as-is in an external postmortem, though you may need to do some rewording for clarity.

“How can we make sure it won’t happen again” will come from the Action items table of the postmortem.

What not to say

With that said, postmortems should never include these three things:

  1. Names of humans - Rather than “John Smith accidentally kicked over a server”, say “a network engineer accidentally kicked over a server,” Internally, we try to express the role of humans in terms of role rather than name. This helps us keep a blameless postmortem culture
  2. Names of internal systems - The names of your internal systems are not clarifying for your users and creates a burden on them to discover how these things fit together. For example, even though we’ve discussed Chubby externally, we still refer to it in postmortems we make external as “our globally distributed lock system.”
  3. Customer-specific information - The internal version of your postmortem will likely say things like “on XX:XX, Acme Corp filed a Support ticket alerting us to a problem.” It’s not your place to share this kind of detail externally as it may create an undue burden for the reporting company (in this case Acme Corp.). Rather, simply say “on XX:XX, a customer filed…”. If you’re going to reference more than one customer, then just label them Customer A, Customer B, etc..

Other things to watch out for

Another source of difficulty when rewriting a shared postmortem is the “Background” section that sets the scene for the incident. An internal postmortem assumes the reader has basic knowledge of the technical and operational background; this is unlikely to be true for your customer. We try to write the least detailed explanation that still allows the reader to understand why the incident happened; too much detail here is more likely to be off-putting than helpful.

Google SREs are fans of embedding monitoring graphs in postmortems; monitoring data is objective and doesn’t generally lie to you (although our colleague Sebastian Kirsch has some very useful guidance as to when this is not true). When you share a postmortem outside the company, however, be careful what information these graphs reveal about traffic levels and number of users of a service. Our rule of thumb is to leave the X axis (time) alone, but for the Y axis either remove the labels and quantities all together, or only show percentages. This is equally true for incorporating customer-generated data in an internal postmortem.

A side note on the role of luck


With apologies to Tina Turner, What’s luck got to do, got to do with it? What’s luck but a source of future failures?

As well as “What went well” and “What went badly” our internal postmortem template includes the section “Where we got lucky.” This is a useful place to tease out risks of future failures that were revealed by an incident. In many cases an incident had less impact than it might have, because of relatively random factors such as timing, presence of a particular person as the on-call, or co-incidence with another outage that resulted in more active scrutiny of the production systems than normal.

“Where we got lucky” is an opportunity to identify additional action items for the postmortem, e.g.,

  • “the right person was on-call” implies tribal knowledge that needs to be fed into a playbook and exercised in a DiRT test
  • “this other thing (e.g., a batch process or user action) wasn’t happening at the same time” implies that your system may not have sufficient surplus capacity to handle a peak load, and you should consider adding resources
  • “the incident happened during business hours” implies a need for automated alerting and 24-hour pager coverage by an on-call
  • “we were already watching monitoring” implies a need to tune alerting rules to pick up the leading edge of a similar incident if it isn’t being actively inspected.

Sometimes teams also add “Where we got unlucky,” when the incident impact was aggravated by a set of circumstances that are unlikely to re-occur. Some examples of unlucky behavior are:

  • an outage occurred on your busiest day of the year
  • you had a fix for the problem that hadn't been rolled out for other reasons
  • a weather event caused a power loss.


A major risk in having a “Where we got unlucky” category is that it’s used to label problems that aren’t actually due to blind misfortune. Consider this example from an internal postmortem:

Where we got unlucky

There were various production inconsistencies caused by past outages and experiments. These weren’t cleaned up properly and made it difficult to reason about the state of production.

This should instead be in “What went badly,” because there are clear action items that could remediate this for the future.

When you have these unlucky situations, you should always document them as part of "What went badly," while assessing the likelihood of them happening again and determining what actions you should take. You may choose not to mitigate every risk since you don’t have infinite engineering time, but you should always enumerate and quantify all the risks you can see so that “future you” can revisit your decision as circumstances change.

Summary


Hopefully we've provided a clear motivation for platform and service providers to share their internal postmortems outside the company, at some appropriate level of detail. In the next installment, we'll discuss how to get the greatest benefit out of these postmortems.

Building good SLOs – CRE life lessons



In a previous episode of CRE Life Lessons, we discussed how choosing good service level indicators (SLIs) and service level objectives (SLOs) is critical for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we’re going to get meta and go into more detail about some best practices we use at Google to formulate good SLOs for our SLIs.

SLO musings


SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Your business needs to be able to defend an endangered SLO by reducing the frequency of outages, or reducing the impact of outages when they occur. Some ways to do this might include: slowing down the rate at which you release new versions, or by implementing reliability improvements instead of features. All parts of your business need to acknowledge that these SLOs are valuable and should be defended through trade-offs.

Here are some important things to keep in mind when designing your SLOs:
  • An SLO can be a useful tool for resolving meaningful uncertainty about what a team should be doing. The objective is a line in the sand between "we definitely need to work on this issue" and "we might not need to work on this issue." Therefore, don’t pick SLO targets that are higher than what you actually need, even if you happen to be meeting them now, as that reduces your flexibility to change things in the future, including trade offs against reliability, like development velocity.
  • Group queries into SLOs by user experience, rather than by specific product elements or internal implementation details. For example, direct responses to user action should be grouped into a different SLO than background or ancillary responses (e.g., thumbnails). Similarly, “read” operations (e.g., view product) should be grouped into a different SLO than lower volume but more important “write” ones (e.g., check out). Each SLO will likely have different availability and latency targets.
  • Be explicit about the scope of your SLOs and what they cover (which queries, which data objects) and under what conditions they are offered. Be sure to consider questions like whether or not to count invalid user requests as errors, or happens when a single client spams you with lots of requests.
  • Finally, though somewhat in tension with the above, keep your SLOs simple and specific. It’s better not to cover non-critical operations with an SLO than to dilute what you really care about. Gain experience with a small set of SLOs; launch and iterate!

Example SLOs


Availability

Here we're trying to answer the question "Was the service available to our user?" Our approach is to count the failures and known missed requests, and report the measurement as a percentage. Record errors from the first point that is in your control (e.g., data from your Load Balancer, not from the browser’s HTTP requests). For requests between microservices, record data from the client side, not the server side.

That leaves us with an SLO of the form:

Availability: <service> will <respond successfully> for <a customer scope> for at least <percentage> of requests in the <SLO period>

For example . . .

Availability: Node.js will respond with a non-503 within 30 seconds for browser pageviews for at least 99.95% of requests in the month.

. . . and . . .

Availability: Node.js will respond with a non-503 within 60 seconds for mobile API calls for at least 99.9% of requests in the month.

For requests that took longer than 30 seconds (60 second for mobile), the service might as well have been down, so they count against our availability SLO.

Latency

Latency is a measure of how well a service performed for our users. We count the number of queries that are slower than a threshold, and report them as a percentage of total queries. The best measurements are done as close to the client as possible, so measure latency at the Load Balancer for incoming web requests, and from the client not the server for requests between microservices.

Latency: <service> will respond within <time limit> for at least <percentage> of requests in the <SLO period>.

For example . . .

Latency: Node.js will respond within 250ms for at least 50% of requests in the month, and within 3000ms for at least 99% of requests in the month.


Percentages are your friend . . .

Note that we expressed our latency SLI as a percentage: “percentage of requests with latency < 3000ms” with target of 99%, not “99th percentile latency in ms” with target “< 3000ms”. This keeps SLOs consistent and easy to understand, because they all have the same unit and the same range. Also, accurately computing percentiles across large data sets is hard, while counting the number of requests below a threshold is easy. You’ll likely want to monitor multiple thresholds (e.g., percentage of requests < 50ms, < 250ms, . . .), but having SLO targets of 99% for one threshold, and possibly 50% for another, is generally sufficient.

Avoid targeting average (mean) latency  it's almost never what you want. Averages can hide outliers, and sufficiently small values are indistinguishable from zero; users will not notice a difference between 50 ms and 250 ms for a full page response time, and thus they should be comparably good. There’s a big difference between an average of 250ms because all requests are taking 250ms, and an average of 250ms because 95% of requests are taking 1ms and 5% of requests are taking 5s.

. . . except 100%

A target of 100% is impossible over any meaningful length of time. It’s also likely not necessary. SREs use SLOs to embrace risk; the inverse of your SLO target is your error budget, and if your SLO target is 100% that means you have no error budget! In addition, SLOs are a tool for establishing team priorities dividing top-priority work from work that's prioritized on a case-by-case basis. SLOs tend to lose their credibility if every individual failure is treated as a top priority.

Regardless of the SLO target that you eventually choose, the discussion is likely to be very interesting; be sure to capture the rationale for your chosen target for posterity.

Reporting

Report on your SLOs quarterly, and use quarterly aggregates to guide policies, particularly pager thresholds. Using shorter periods tends to shift focus to smaller, day-to-day issues, and away from the larger, infrequent issues that are more damaging. Any live reporting should use the same sliding window as the quarterly report, to avoid confusion; the published quarterly report is merely a snapshot of the live report.

Example quarterly SLO summary

This is how you might present the historical performance of your service against SLO, e.g., for a semi-annual service report, where the SLO period is one quarter:

SLO
Target
Q2
Q3
Web Availability
99.95%
99.92%
99.96%
Mobile Availability
99.9%
99.91%
99.97%
Latency ≤ 250ms
50%
74%
70%
Latency ≤ 3000ms
99%
99.4%
98.9%

For SLO-dependent policies such as paging alerts or freezing of releases when you’ve spent the error budget, use a sliding window shorter than a quarter. For example, you might trigger a page if you spent ≥1% of the quarterly error budget over the last four hours, or you might freeze releases if you spent ≥ ⅓ of the quarterly budget in the last 30 days.

Breakdowns of SLI performance (by region, by zone, by customer, by specific RPC, etc.) are useful for debugging and possibly for alerting, but aren’t usually necessary in the SLO definition or quarterly summary.

Finally, be mindful about with whom you share your SLOs, especially early on. They can be a very useful tool for communicating expectations about your service, but the more broadly they are exposed the harder it is to change them.

Conclusion


SLOs are a deep topic, but we’re often asked about handy rules of thumb people can use to start reasoning about them. The SRE book has more on the topic, but if you start with these basic guidelines, you’ll be well on your way to avoiding the most common mistakes people make when starting with SLOs. Thanks for reading, we hope this post has been helpful. And as we say here at Google, may the queries flow, your SLOs be met and the pager stay silent!

CRE life lessons: The practicalities of dark launching



In the first part of this series, we introduced you to the concept of dark launches. In a dark launch, you take a copy of your incoming traffic and send it to the new service, then throw away the result. Dark launches are useful when you want to launch a new version of an existing service, but don’t want nasty surprises when you turn it on.

This isn’t always straightforward as it sounds, however. In this blog post, we’ll look at some of the circumstances that can make things difficult for you, and teach you how to work around them.

Finding a traffic source

Do you actually have existing traffic for your service? If you’re launching a new web service which is not more-or-less-directly replacing an existing service, you may not.

As an example, say you’re an online catalog company that lets users browse items from your physical store’s inventory. The system is working well, but now you want to give users the ability to purchase one of those items. How would you do a dark launch of this feature? How can you approximate real usage when no user is even seeing the option to purchase an item?

One approach is to fire off a dark-launch query to your new component for every user query to the original component. In our example, we might send a background “purchase” request for an item whenever the user sends a “view” request for that item. Realistically, not every user who views an item will go on to purchase it, so we might randomize the dark launch by only sending a “purchase” request for one in every five views.

This will hopefully give you an approximation of live traffic in terms of volume and pattern. Note that this can’t be expected to be totally accurate when it comes to to live traffic when the service is launched. But, it’s better than nothing.

Dark launching mutating services

Generally, a read-only service is fairly easy to dark-launch. A service with queries that mutate backend storage is far less easy. There are still strong reasons for doing the dark launch in this situation, because it gives you some degree of testing that you can’t reasonably get elsewhere, but you’ll need to invest significant effort to get the most from dark-launching.

Unless you’re doing a storage migration, you’ll need to make significant effort/payoff tradeoffs doing dark launches for mutating queries. The easiest option is to disable the mutates for the dark-launch traffic, returning a dummy response after the mutate is prepared but before it’s sent. This is safe, but it does mean that you’re not getting a full measurement of the dark launched service — what if it has a bug that causes 10% of the mutate requests to be incorrectly specified?

Alternatively, you might choose to send the mutation to a temporary duplicate of your existing storage. This is much better for the fidelity of your test, but great care will be needed to avoid sending real users the response from your temporary duplicate. It would also be very unfortunate for everyone if, at the end of your dark launch, you end up making the new service live when it’s still sending mutations to the temporary duplicate storage.

Storage migration

If you’re doing a storage migration — moving an existing system’s stored data from one storage system to another (for instance, MySQL to MongoDB because you’ve decided that you don’t really need SQL after all) — you’ll find that dark launches will be crucial in this migration, but you’ll have to be particularly careful about how you handle mutation-inducing queries. Eventually you’ll need mutations to take effect in both your old and new storage systems, and then you’ll need to make the new storage system the canonical storage for all user queries.

A good principle is that, during this migration, you should always make sure that you can revert to the old storage system if something goes wrong with the new one. You should know which of your systems (old and new) is the master for a given set of queries, and hence holds the canonical state. The mastership generally needs to be easily mutable and able to revert responsibility to the original storage system without losing data.

The universal requirement for a storage migration is a detailed written plan reviewed by not just your system stakeholders but also by your technical experts from the involved systems. Inevitably, your plan will miss things and will have to adapt as you move through the migration. Moving between storage systems can be an awfully big adventure — expect us to address this in a future blog post.

Duplicate traffic costs

The great thing about a well-implemented dark launch is that it exercises the full service in processing a query, for both the original and new service. The problem this brings is that each query costs twice as much to process. That means you should do the following:


  • Make sure your backends are appropriately provisioned for 2x the current traffic. If you have quota in other teams’ backends, make sure it’s temporarily increased to cover the dark launch as well.
  • If you’re connection-sensitive, ensure that your frontends have sufficient slack to accommodate a 2x connection count.
  • You should already be monitoring latency from your existing frontends, but keep a close eye on this monitoring stat and consider tightening your existing alerting thresholds. As service latency increases, service memory likely also increases, so you’ll want to be alert for either of these stats breaching established limits.


In some cases, the service traffic is so large that a 100% dark launch is not practical. In these instances, we suggest that you determine the largest percentage launch that is practical and plan accordingly, aiming to get the most representative selection of traffic in the dark launch. Within Google, we tend to launch a new service to Googlers first before making the service public. However, experience has taught us that Googlers are often not representative of the rest of the world in how they use a service.

An important consideration if your service makes substantial use of caching is that a sub-50% dark launch is unlikely to see material benefits from caching and hence will probably significantly overstate estimated load at 100%.

You may also choose to test-load your new service at over 100% of current traffic by duplicating some traffic — say, firing off two queries to the new service for every original query. This is fine, but you should scale your quota increases accordingly. If your service is cache-sensitive, then this approach will probably not be useful as your cache hit rate will be artificially high.

Because of the load impact of duplicate traffic, you should carefully consider how to use load shedding in this experiment. In particular, all dark launch traffic should be marked “sheddable” and hence be the first requests to be dropped by your system when under load.

In any case, if your service on-call sees an unexpected increase in CPU/memory/latency, they should drop the dark launch to 0% and see if that helps.

Summary

If you’re thinking about a dark launch for a new service, consider writing a dark launch plan. In that plan, make sure you answer the following questions:


  • Do you have existing traffic which you can fork and send to your new service?
  • Where will you fork the traffic: the application frontend, or somewhere else?
  • Will you fire off the message to the new backend asynchronously, or will you wait for it and impose a timeout?
  • What will you do with requests that generate mutations?
  • How and where will you log the responses from the original and new services, and how will you compare them?
    • Are you logging the following things: response code, backend latency, and response message size?
    • Will you be diffing responses? Are there fields that cannot meaningfully be diffed which you should skip in your comparison?
  • Have you made sure that your backends can handle 2x the current peak traffic, and have you given them temporary quota for it?
    • If not, at what percentage traffic will you stop the dark launch?
  • How are you going to select traffic for participation in the dark launch percentage: randomly, or by hashing on a key such as user ID?
  • Which teams need to know that this dark launch is happening? Do they know how to escalate concerns?
  • What’s your rollback plan after you make your new service live?


It may be that you don’t have enough surprises or excitement in your life; in that case, you don’t need to worry about dark launches. But if you feel that your service gives you enough adrenaline rushes already, dark launching is a great technique to make service launches really, really boring.

CRE life lessons: What is a dark launch, and what does it do for me?



Say you’re about to launch a new service. You want to make sure it’s ready for the traffic you expect, but you also don’t want to impact real users with any hiccups along the way. How can you find your problem areas before your service goes live? Consider a dark launch.

A dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user. (Note: We’ve also seen dark launches referred to as “feature toggles,” but this doesn’t generally capture the “dark” or hidden traffic aspect of the launch.)

Dark launches allow you to do two things:

  1. Verify that your new service handles realistic user queries in the same way as the existing service, so you don’t introduce a regression.
  2. Measure how your service performs under realistic load.
Dark launches typically transition gradually from a small percentage of the original traffic to a full (100%) dark launch where all traffic is copied to the new backend, discovering and resolving correctness and scaling issues along the way. If you already have a source of traffic for your new site — for instance, when you’re migrating from an existing frontend to a new frontend — then you’re an excellent candidate for a dark launch.


Where to fork traffic: clients vs. servers

When considering a dark launch, one key question is where the traffic copying/forking should happen. Normally this is the application frontend, i.e. the first service, which (after load balancing) receives the HTTP request from your user and calculates the response. This is the ideal place to do the fork, since it has the lowest friction of change — specifically, in varying the percentage of external traffic sent to the new backend. Being able to quickly push a configuration change to your application frontend that drops the dark launch traffic fraction back down to 0% is an important — though not crucial — requirement of a dark launch process.

If you don’t want to alter the existing application frontend, you could replace it with a new proxy service which does the traffic forking to both your original and a new version of the application frontend and handles the response diffing. However, this increases the dark launch’s complexity, since you’ll have to juggle load balancing configurations to insert the proxy before the dark launch and remove it afterwards. Your proxy almost certainly needs to have its own monitoring and alerting — all your user traffic will be going through it, and it’s completely new code. What if it breaks?

One alternative is to send traffic at the client level to two different URLs, one for the original service, and the other for the new service. This may be the only practical solution if you’re dark launching an entirely new app frontend and it’s not practical to forward traffic from the existing app frontend — for instance, if you’re planning to move a website from being served by an open-source binary to your own custom application. However, this approach comes with its own set of challenges.



The main risk in client changes is the lack of control over the client’s behavior. If you need to turn down the traffic to the new application, then you’ll at least need to push a configuration update to every affected mobile application. Most mobile applications don’t have a built-in framework for dynamically propagating configuration changes, so in this case you’ll need to make a new release of your mobile app. It also potentially doubles the traffic from mobile apps, which may increase user data consumption.

Another client change risk is that the destination change gets noticed, especially for mobile apps whose teardowns are a regular source of external publicity. Response diffing and logging results is also substantially easier within an application frontend than within a client.


How to measure a dark launch

It’s little use running a dark launch if you’re not actually measuring its effect. Once you’ve got your traffic forked, how do you tell if your new service is actually working? How will you measure its performance under load?

The easiest way is to monitor the load on the new service as the fraction of dark launch traffic ramps up. In effect, it’s a very realistic load test, using live traffic rather than canned traffic. Once you’re at 100% dark launch and have run over a typical load cycle — generally, at least one day — you can be reasonably confident that your server won’t actually fall over when the launch goes live.

If you’re planning a publicity push for your service, you should try to maximize the additional load you put on your service and adjust your launch estimate based on a conservative multiplier. For example, say that you can generate 3 dark launch queries for every live user query without affecting end-user latency. That lets you test how your dark-launched service handles three times the peak traffic. Do note, however, that increasing traffic flow through the system by this amount carries operational risks. There is a danger that your “dark” launch suddenly generates a lot of “light” — specifically, a flickering yellow-orange light which comes from the fire currently burning down your service. If you’re not already talking to your SREs, you need to open a channel to them right now to tell them what you’re planning.

Different services have different peak times. A service that serves worldwide traffic and directly faces users will often peak Monday through Thursday in the morning in the US as this is where users normally dominate traffic. By contrast, a service like a photo upload receiver is likely to peak on weekends when users take more photos, and will get huge spikes on major holidays like New Years. Your dark launch should try to cover the heaviest live traffic that it’s reasonable to wait for.

We believe that you should always measure service load during a dark launch as it is very representative data for your service and requires near-zero effort to do.

Load is not the only thing you should be looking at, however, as the following measurements should also be considered.


Logging needs

The point where incoming requests are forked to the original and new backends — generally, the application front end — is typically also the point where the responses come back. This is, therefore, a great place to record the responses for later analysis. The new backend results aren’t being returned to the user, so they’re not normally visible directly in monitoring at the application frontend. Instead, the application will want to log these responses internally.

Typically the application will want to log response code (e.g. 20x/40x/50x), latency of the query to the backend, and perhaps the response size, too. It should log this information for both the old and new backends so that the analysis can be a proper comparison. For instance, if the old backend is returning a 40x response for a given request, the new backend should be expected to return the same response, and the logs should enable developers to make this comparison easily and spot discrepancies.

We also strongly recommend that responses from original and new services are logged and compared throughout dark launches. This tells you whether your new service is behaving as you expect with real traffic. If your logging volume is very high, and you choose to use sampling to reduce the impact on performance and cost, make sure that you account in some way for the undetected errors in your traffic that were not included in the logs sample.

Timeouts as a protection

It’s quite possible that the new backend is slower than the original — for some or all traffic. (It may also be quicker, of course, but that’s less interesting.) This slowness can be problematic if the application or client is waiting for both original and new backends to return a response before returning to the client.

The usual approaches are either to make the new backend call asynchronous, or to enforce an appropriately short timeout for the new backend call after which the request is dropped and a timeout logged. The asynchronous approach is preferred, since the latter can negatively impact average and percentile latency for live traffic.

You must set an appropriate timeout for calls to your new service, and you should also make those calls asynchronous from the main user path, as this minimizes the effect of the dark launch on live traffic.


Diffing: What’s changed, and does it matter?

Dark launches where the responses from the old and new services can be explicitly diff’ed produce the most confidence in a new service. This is often not possible with mutations, because you can’t sensibly apply the same mutation twice in parallel; it’s a recipe for conflicts and confusion.

Diffing is nearly the perfect way to ensure that your new backend is drop-in compatible with the original. At Google, it’s generally done at the level of protocol buffer fields. There may be fields where it’s acceptable to tolerate differences, e.g. ordering changes in lists. There’s a trade-off between the additional development work required for a precise meaningful comparison and the reduced launch risk this comparison brings. Alternatively, if you expect a small number of responses to differ, you might give your new service a “diff error budget” within which it must fit before being ready to launch for real.

You should explicitly diff original and new results, particularly those with complex contents, as this can give you confidence that the new service is a drop-in replacement for the old one. In the case of complex responses, we strongly recommend either setting a diff “error budget” (accept up to 1% of responses differing, for instance) or excluding low-information, hard-to-diff fields from comparison.

This is all well and good, but what’s the best way to do this diffing? While you can do the diffing inline in your service, export some stats, and log diffs, this isn't always the best option. It may be better to offload diffing and reporting out of the service that issues the dark launch requests.

Within Google, we have a number of diffing services. Some run batch comparisons, some process data in live streams, others provide a UI for viewing diffs in live traffic. For your own service, work out what you need from your diffing and implement something appropriate.

Going live

In theory, once you’ve dark-launched 100% of your traffic to the new service, making it go “live” is almost trivial. At the point where the traffic is forked to the original and new service, you’ll return the new service response instead of the original service response. If you have an enforced timeout on the new service, you’ll change that to be a timeout on the old service. Job done! Now you can disable monitoring of your original service, turn it off, reclaim its compute resources, and delete it from your source code repository. (A team meal celebrating the turn-down is optional, but strongly recommended.) Every service running in production is a tax on support and reliability, and reducing the service count by turning off a service is at least as important as adding a new service.

Unfortunately, life is seldom that simple. (As id Software’s John Cash once noted, “I want to move to ‘theory,’ everything works there.”) At the very least, you’ll need to keep your old service running and receiving traffic for several weeks in case you run across a bug in the new service. If things start to break in your new service, your reflexive action should be to make the original service the definitive request handler because you know it works. Then you can debug the problem with your new service under less time pressure.

The process of switching services may also be more complex than we’ve suggested above. In our next blog post, we’ll dig into some of the plumbing issues that increase the transition complexity and risk.

Summary

Hopefully you’ll agree that dark launching is a valuable tool to have when launching a new service on existing traffic, and that managing it doesn’t have to be hard. In the second part of this series, we’ll look at some of the cases that make dark launching a little more difficult to arrange, and teach you how to work around them.

Making the most of an SRE service takeover – CRE life lessons



In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

Onboarding preparation

If a service entrance review determines that the service is suitable for SRE support, developers and the SRE team move into the “onboarding” phase, where they prepare for SREs to support the service.

While developers address the action items, the SRE team starts to familiarize itself with the service, building up service knowledge and familiarity with the existing monitoring tools, alerts and crisis procedures. This can be accomplished through several methods:
  • Education: present the new service to the rest of the team through tech talks, discussion sessions and "wheel of misfortune" scenarios.
  • “Take the pager for a spin”: share pager alerts with the developers for a week, and assess each page on the axes of criticality (does this indicate a user-impacting problem with the service?) and actionability (is there a clear path for the on-call to to resolve the underlying issue?). This gives the SRE team a quantitative measure of how much operational load the service is likely to impose.
  • On-call shadow: page the primary on-call developer and SRE at the same time. At this stage, responsibility for dealing with emergencies rests on the developer, but the developer and the SRE collaborate on debugging and resolving production issues together.

Measuring success


Q: I’ve gone through a lot of effort to make my service ready to hand over to SRE. How can I tell whether it was a good expenditure of scarce engineering time?

If the developer and SRE teams have agreed to hand over a system, they should also agree on criteria (including a timeframe) to measure whether the handover was successful. Such criteria may include (with appropriate numbers):
  • Absolute decrease of paging/outages count
  • Decreasing paging/outages as a proportion of (increasing) service scale and complexity.
  • Reduced time/toil from the point of new code passing tests to being deployed globally, and a flat (or decreasing) rollback rate.
  • Increased utilization of reserved resources (CPU, memory, disk etc.)
Setting these criteria can then prepare the ground for future handover proposals; if the success criteria for a previous handover were not met, the teams should carefully reconsider how this will change the handover plans for a new service.

Taking over the pager


Once all the blocking action items have been resolved, it’s time for SREs to take over the service pager. This should be a "no drama" event, with few, well-documented service alerts, that can be easily resolved by following procedures in the service playbook.

In theory, the SRE team will have identified most of these issues in the entrance review phase, but realistically there any many issues that are only apparent with sustained exposure to a service.

In the medium term (one to two months), SREs should build a list of deficiencies or areas for optimization in the system with regard to monitoring, resource consumption etc. This hitlist should primarily aim to reduce SRE “toil” (manual, repetitive, tactical work that has no enduring value), and secondarily fix aspects of the system, e.g., resource consumption or cruft accumulation, which can impact system performance. Tertiary changes may include things like updating the documentation to facilitate onboarding new SREs for system support.

In the long term (three to six months), SREs should expect to meet most or all of the pre-established measurements for takeover success as described above.

Q: That’s great, so now my developers can turn off their pager?

Not so fast, my friend. Although the SRE team has learned a lot about the service in the preceding months, they're still not experts; there will inevitably be failure modes involving arcane service behavior where the SRE on-call will not know what has broken, or how to fix it. There's no substitute for having a developer available, and we normally require developers to keep their on-call rotation so that the SRE on-call can page them if needed. We expect this to be a low rate of pages.

The nuclear option — handing back the pager


Not all SRE takeovers go smoothly, and even if the SREs have taken over the pager for a service, it’s possible for reliability to regress or operational load to increase. This might be for good reasons such as a “success disaster”  a sustained and unexpected spike in usage  or for bad reasons such as poor QA testing.

An SRE team can only handle so many services, and if one service starts to consume a disproportionate amount of SRE time, it's at risk of crowding out other services. In this case, the SRE team should proactively tell the developer team that they have a problem, and should do so in a neutral way that’s data-heavy:

In the past month we’ve seen 100 pages/week for service S1, compared to a steady rate of 20-30 pages/week over the past few weeks. Even though S1 is within SLO, the pages are dominating our operational work and crowding out service improvement work. You need to do one of the following:
  1. bring S1’s paging rate down to the original rate by reducing S1’s rate of change
  2. de-tune S1’s alerts so that most of them no longer page
  3. tell us to drop SRE support for services S2, S3 so our overall paging rate remains steady
  4. tell us to drop SRE support for S1
This lets the developer team decide what’s most important to them, rather than the SRE team imposing a solution.

There are also times when developers and SREs agree that handing back the pager to developers is the right thing to do, even if the operational load is normal. For example, imagine SREs are supporting a service, and developers come up with a new, shiny, higher-performing version. Developers support the new version initially, while working out its kinks, and migrate more and more users to it. Eventually the new version is the most heavily used  this is when SREs should take on the pager for the new service and hand the old service’s pager back to developers. Developers can then finish user migrations and turn down the old service at their convenience.

Converging your SRE and dev teams


Onboarding a service is about more than transferring responsibility from developers to SREs  it also improves mutual understanding between the two teams. The dev team gets to know what the SRE team does, and why, who the individual SREs are, and perhaps how they got that way. Similarly the SRE team gains a better understanding of the development team’s work and concerns. This increase in empathy is a Good Thing in itself, but is also an opportunity to improve future applications.

Now, when a developer team designs a new application or service, they should take the opportunity to invite the SRE team to the discussion. SRE teams can easily spot reliability issues in the design, and advise developers on ways to make the service easier to operate, set up good monitoring and configure sensible rollout policies from the start.

Similarly, when the SREs do future planning or design new tooling, they should include developers in the discussions; developers can advise them on future launches and projects, and give feedback on making the tools easier to operate or a better fit for developers’ needs.

Imagine that there was a brick wall between the SRE and developer teams; our original plan for service takeover was to throw the service over the wall and hope. Over the course of these blog posts, we’ve shown you how to make a hole in the wall so there can be two-way communication as the service is passed through, then expand it into a doorway so that SREs can come into the developers’ backyard and vice versa. Eventually, developers and SREs should tear down the wall entirely, and replace it with a low hedge and ornamental garden arch. SREs and developers should be able to see what’s going on in each others’ yard, and wander over to the other side as needed.


Summary


When an SRE takes on pager responsibility for developer-supported service, don’t just throw it over the fence into their yard. Work with the SRE team to help them understand how the service works and how it breaks, and to find ways to make it more resilient and easier to support. Make sure that supporting your service is a good use of the SRE team’s time, making use of their particular skills. With a carefully-planned handover process, you can both be confident that the queries will flow and your pagers will be (mostly) silent.

How SREs find the landmines in a service – CRE life lessons



In Part 1 of this blog post we looked at why an SRE team would or wouldn’t choose to onboard a new application. In this installment, we assume that the service would benefit substantially from SRE support, and look at what needs to be done for SREs to onboard it with confidence.

Onboarding review


Q: We have a new application that would make sense for SRE to support. Do I just throw it over the wall and tell the SRE team “Here you are; you’re on call for this now, best of luck”?

That’s a great approach  if your goal is failure. At first, your developer team’s assessment of the application’s importance for their support  and whether it’s in a supportable state  is likely to be rather different from your SRE team’s assessment, and arbitrarily imposing support for a service onto an SRE team is unlikely to work. Think about it  you haven’t convinced them that the service is a good use of their time yet, and human nature is that people don’t enthusiastically embrace doing something that they don’t really believe in, so they're unlikely to be active participants in making the service materially more reliable.

At Google, we’ve found that to successfully onboard a service into SRE, the service owner and SRE team must agree to a process for the SRE team to understand and assess the service, and identify critical issues to be resolved upfront (Incidentally, we follow a similar process when deciding whether or not to onboard a Google Cloud customer’s application into our Customer Reliability Engineering program). We typically split this into two phases:

  • SRE entrance review: where an SRE team assess whether a developer-supported service should be onboarded by SRE, and what the onboarding preconditions should be.
  • SRE onboarding/takeover: where a dev and SRE team agree in principle that the SRE team should take on primary operational responsibility for a service, and start negotiating the exact conditions for takeover (how and when the SREs will onboard the service).

It’s important to remember the motivations of the various parties in this process:

  • Developers want someone else to pick up support for the service, and make it run as well as possible. They want users to feel that the service is working properly, otherwise they'll move to a service run by someone else.
  • The SRE team wants to be sure that they're not being “sold a pup” with a hard-to-support service, and have a vision for making the production service lower in toil and more robust.
  • Meanwhile the company management wants to reduce the number of embarrassing service outages, as long as it doesn’t cost them too much in engineer time.

The SRE entrance review

During an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production. The purpose of an SER is to:

  1. Assess how the service would benefit from SRE ownership
  2. Identify service design, implementation and operational deficiencies that could be a barrier to SRE takeover
  3. And if SRE ownership is determined to be beneficial, identify the bug fixes, process changes and necessary service behavior needed before onboarding the service

An SRE team typically designates a single person or a small subset of the team to familiarize themselves with the service, and evaluate it for fitness for takeover.

The SRE looks at the service as-is: its performance, monitoring, associated operational processes and recent outage history, and asks themselves: “If I were on-call for this service right now, what are the problems I’d want to fix?” They might be visible problems, such as too many pages happening per day, or potential problems such as a dependency on a single machine that will inevitably fail some day.

A critical part of any SRE analysis is the service’s Service Level Objectives (SLOs), and associated Service Level Indicators (SLIs). SREs assume that if a service is meeting its SLOs then paging alerts should be rare or non-existent; conversely, if the service is in danger of falling out of SLO then paging alerts are loud and actionable. If these expectations don’t match reality, the SRE team will focus on changing either the SLO definitions or the SLO measurements.

In the review phase, SREs aim to understand:

  • what the service does
  • day-to-day service operation (traffic variation, releases, experiment management, config pushes)
  • how the service tends to break and how this manifests in alerts
  • rough edges in monitoring and alerting
  • where the service configuration diverges from the SRE team’s practices
  • major operational risks for the service


The SRE team also considers:

  • whether the service follows SRE team best practices, and if not, how to retrofit it
  • how to integrate the service with the SRE team’s existing tools and processes
  • the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?


The SRE takeover


The SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed. Most will be assigned to the development team, but the SRE team may be better suited for others. In addition, not all issues are blockers to SRE takeover (there might be design or architectural changes that SREs recommend for service robustness that could take many months to implement).

There are four main axes of improvement for a service in an onboarding process: extant bugs, reliability, automation and monitoring/alerting. On each axis there will be issues which will have to be solved before takeover (“blockers”), and others which would be beneficial to solve but not critical.

Extant bugs
The primary source of issues blocking SRE takeover tends to be action items from the service’s previous postmortems. The SRE team expects to read recent postmortems and verify that a) the proposed actions to resolve the outage root causes are what they’d expect and b) those actions are actually complete. Further, the absence of recent postmortems is a red flag for many SRE teams.
Reliability
Some reliability-related change requests might not directly block SRE takeover, as many reliability improvements relate to design, significant code changes, a change in back-end integrations or migration off a deprecated infrastructure component, and are targeting the longer-term evolution of the system towards a desired reliability increase.

The reliability-related changes that block takeover would be those which mitigate or remove issues which are known to cause significant downtime, or mitigate risks which are expected to cause an outage in the future.

Automation
This is a key concern for SREs considering take over of a service: how much manual work needs to be done to “operate” the service on a week-to-week basis, including configuration pushes, binary releases and similar time-sinks.

In order to find out what would be most useful to automate, the best way is for the SRE to get practical experience of the developer’s world. This means that the SREs should shadow the developer team’s typical week and get a feel for what routine manual work is actually involved for their on-call.

If there’s excessive manual work involved in supporting a service, automation usually solves the problem.

Monitoring/alerting
The dominant concern with most services undergoing SRE takeover is the paging rate  how many times the service wakes up the on-call staff. At Google, we adhere to the ”Treynor Maximum” of an average of two incidents per 12 hour shift (for an on-call team as a whole). Thus, an SRE team looks at the average incident load of a new service over the past month or so to see how it fits with their current incident load.

Generally, excessive paging rates are the result of one of three things:

  1. Paging on something that’s not intrinsically important e.g., task restart or hitting 80% capacity of disk. Instead, downgrade the page to a bug (if it’s not urgent) or eliminate it entirely. Moving to symptom-based monitoring (“users are actually seeing problems”) can help improve this situation.
  2. Page storms where one small incident/outage generates many pages. Try to group related pages for an incident into a single outage, to get a clearer picture of the system’s outage metrics.
  3. A system that’s having too many genuine problems. In this case SRE takeover in the near future is unlikely, but SREs may be able to help diagnose and resolve the root causes of the problems.
SREs generally want to see several weeks of low paging levels before agreeing to take over a service.

More general ways to improve the service might include:

  • integrating the service with standard SRE tools and practices e.g., load shedding, release processes and configuration pushes
  • extending and improving playbook entries to rely less on the developer team’s tribal knowledge
  • aligning the service’s configurations with the SRE team’s common languages and infrastructure
Ultimately, an SRE entrance review should produce guidance that's useful to developers even if the SRE team declines to onboard the service. In that event, the guidance from the review should still help developers make their service easier to operate and more reliable.

Smoothing the path


SREs need to understand the developers’ service, but SREs and developers also need to understand each other. If the developer team has not worked with SREs before, it can be useful for SREs to give “lightning” talks to the developers on SRE topics such as monitoring, canarying, rollouts and data integrity. This gives the developers a better idea of why the SREs are asking particular questions and pushing particular concerns.

One of Google’s SREs found that it was useful to “pretend that I am a dev team novice, and have the developer take me through the codebase, explain the history, show me where the main() function is, and so on.”

Similarly, SREs should understand the developers’ point of view and experience. During the SER, at least one SRE should sit with the developers, attend their weekly meetings and stand-ups, informally shadow their on-call and help out with day-to-day work to get a “big picture” view of the service and how it runs. It also helps remove distance between the two teams. Our experience has been that this is so positive in improving the developer-SRE relationship that the practice tends to continue even after the SER has finished.

Last but not least, the SRE entrance review document should also state clearly whether the service merits SRE takeover, and if so, why (or why not).

At this point, the developer team and SRE team both understand what needs to be done to make a service suitable for SRE takeover, if it is indeed feasible at all. In Part 3 of this blog post, we’ll look at how to proceed with a service takeover, and so both teams can benefit from the process.

Know thy enemy: how to prioritize and communicate risks – CRE life lessons



Editor’s note: We’ve spent a lot of time in CRE Life Lessons talking about how to identify and mitigate risks in your system. In this post, we’re going to talk about how to effectively communicate and stack-rank those risks.

When a Google Cloud customer engages with Customer Reliability Engineering (CRE), one of the first things we do is an Application Reliability Review (ARR). First, we try to understand your application’s goals: what it provides to users and the associated service level objectives (SLOs) (or we help you create SLOs if you do not have any!). Second, we evaluate your application and operations to identify risks that threaten your ability to reach your SLOs. For each identified risk, we provide a recommendation on how to eliminate or mitigate it based on our experiences at Google.

The number of risks identified for each application varies greatly depending on the maturity of your application and team and target level for reliability or performance. But whether we identify five risks or 50, two fundamental facts remain true: Some risks are worse than others, and you have a finite amount of engineering time to address them. You need a process to communicate the relative importance of the risks and to provide guidance on which risks should be addressed first. This appears easy, but beware! The human brain is notoriously unreliable at comparing and evaluating risks.

This post explains how we developed a method for analyzing risks during an ARR, allowing us to present our customers with a clear, ranked list of recommendations, explain why one risk is ranked above another, and describe the impact a risk may have on the application’s SLO target. By the end of this post, you’ll understand how to apply this to your own application, even without going through a CRE engagement.

Take one: the risk matrix

Each risk has many properties that can be used to evaluate its relative importance. In discussions internally and with customers, two properties in particular stand out as most relevant:
  • The likelihood of the risk occurring in a given time period.
  • The impact that would be felt if the risk materializes.
We began by defining three levels for each property, which are represented in the following 3x3 table.

Example table with representative risks for each category: The row headers represent likelihood and column headers represent impact.

Catastrophic
Damaging
Minimal
Frequent
Overload results in slow or dropped requests during the peak hour each day.
The wrong server is turned off and requests are dropped.
Restarts for weekly upgrades drop in-progress requests (i.e., no lame ducking).
Common
A bad release takes the entire service down. Rollback is not tested.
Users report an outage before monitoring and alerting notifies the operator.
A daylight savings bug drops requests.
Rare
There is a physical failure in the hosting location that requires complete restoration from a backup or disaster recovery plan.
Overload results in a cascading failure. Manual intervention is required to halt or fix the issue.
A leap year bug causes all servers to restart and drop requests.
We tested this approach with a couple of customers by bucketing the risks we had identified into the table. This is not a novel approach. We very quickly realized that our terminology and format are the same as that used in a risk matrix, a commonly used management tool in the risk assessment field. This realization seemed to confirm that we were on the right track, and had created something that customers and their management could easily understand.

We were right: Our customers told us that the table of risks was a good overview and was easy to grasp. However, we struggled to explain the relative importance of entries in the list based on the cells in the table:
  • The distribution of risks across the cells was extremely uneven. Most risks ended up in the “common, damaging” cell, which doesn’t help to explain relative importance of the items within each cell.
  • Assigning a risk to a cell (and its subsequent position in the list of risks) is subjective and depends on the reliability target of the application. For example, the “frequent, catastrophic” example of dropping traffic for a few minutes during a release is catastrophic at four nines, but less so at two nines.
  • Ordering the cells into a ranking is not straightforward. Is it more important to handle a “rare, catastrophic” risk, or a “frequent, minimal” risk? The answer is not clear from the names or definitions of the categories alone. Further, the desired order can change from matrix to matrix depending on the number of items in each cell.

Risk expressed as expected losses

As we showed in the previous section, the traditional risk matrix does a poor job of explaining the relative importance of each risk. However, the risk assessment field offers another useful model: using impact and likelihood to calculate the expected loss from a risk. Expressed as a numeric quantity, this expected loss value is great way to explain the relative importance of our list of risks.

How do we convert qualitative concepts of impact and likelihood to quantified values that we can use to calculate expected loss? Consider our earlier posts on availability and SLOs, specifically, the concepts of Mean Time Between Failure (MTBF), Mean Time To Recover (MTTR), and error budget. The MTBF of a risk provides a measure of likelihood (i.e., how long it takes for the risk to cause a failure), the MTTR provides a measure of impact (i.e., how long we expect the failure to last before recovering), and the error budget is the expected number of downtime minutes per year that you're willing to allow (a.k.a. accepted loss).

Now with this system, when we work through an ARR and catalog risks, we use our experience and judgement to estimate each risk’s MTBF (counted in days) and the subsequent MTTR (counted in minutes out of SLO). Using these two values, we estimate the expected loss in minutes for each risk over a fixed period of time, and generate the desired ranking.

We found that calculating expected losses over a year is a useful timeframe for risk-ranking, and developed a three-colour traffic light system to provide high-level guidance and quick visual feedback on the magnitude of each risk vs. the error budget:
  • Red: This risk is unacceptable, as it falls above the acceptable error budget for a single risk (we typically use 25%), and therefore, can have a major impact on your reliability in a single event.
  • Amber: This risk should not be acceptable, as it’s a major consumer of your error budget and therefore, needs to be addressed. You may be able to accept some amber risks by addressing some less urgent (green) risks to buy back budget.
  • Green: This is an acceptable risk. It's not a major consumer of your error budget, and in aggregate, does not cause your application to exceed the error budget. You don't have to address green risks, but may wish to do so to give yourself more budget to cover unexpected risks, or to accept amber risks that are hard to mitigate or eliminate.
Based on the three-colour traffic light system, the following table demonstrates how we rank and colour the risks given a 3-nines availability target. The risks are a combination of those in the original matrix and some additional examples to help illustrate the amber category. You can refer to the spreadsheet linked at the end of this post to see the precise MTTR and MTBF numbers that underlie this table, along with additional examples of amber risks.
Risk
Bad minutes/year
Overload results in slow or dropped requests during the peak hour each day.
3559
A bad release takes the entire service down. Rollback is not tested.
507
Users report an outage before monitoring and alerting notifies the operator.
395
There is a physical failure in the hosting location that requires complete restoration from a backup or disaster recovery plan.
242
The wrong server is turned off and requests are dropped.
213
Overload results in a cascading failure. Manual intervention is required to halt or fix the issue.
150
Operator accidentally deletes database; restore from backup is required
129
Unnoticed growth in usage triggers overload; service collapses.
125
A configuration mishap reduces capacity; causing overload and dropped requests
122
A new release breaks a small set of requests; not detected for a day.
119
Operator is slow to debug and root cause bug due to noisy alerting
76
A daylight savings bug drops requests.
71
Restarts for weekly upgrades drop in-progress requests (i.e., no lame ducking).
52
A leap year bug causes all servers to restart and drop requests.
16

Other Considerations

The ranked list of risks is extremely useful for communicating the findings of an ARR and conveying the relative magnitude of the risks compared to each other. We recommend that you use the list only for this purpose. Do not prioritize your engineering work directly based on the list. Instead, use the expected loss values as inputs to your overall business planning process, taking into consideration remediation and opportunity costs to prioritize work.

Also, don’t be tricked into thinking that because you have concrete numbers for the expected loss, that they are precise! They’re only as good as the estimates derived from MTBF and MTTR values. In the best case, MTBF and MTTR are averages from observed data; more commonly, they will be estimates based purely on intuition and experience. To minimize introducing errors into the final ranking, we recommend estimating MTBF and MTTR values likely to be within an order of magnitude of correct, rather than use specific, potentially inaccurate values.

Somewhat in contrast to the advice just mentioned, we find it useful to introduce additional granularity into the calculation of MTBF and MTTR values, for more accurate estimates. First, we split MTTR into two components:
  • Mean Time To Detect (MTTD): The time between when the risk first manifests and when the issue is brought to the attention of someone (or something) capable of remediating it.
  • Mean Time To Repair (MTTR): Redefined to mean the time between when the issue is brought to the attention of someone capable of remediating it and when it is actually remediated.
This granularity is driven by the realization that, often, the time to notice an issue and the time to fix it differ significantly. It’s easier to assess and ensure estimates are consistent across risks with these figures separately specified.

Second, in addition to considering MTTD, we also factor in what proportion of the users are affected by a risk (e.g., in a sharded system, shards can fail at a given rate and incur downtime before a successful failover succeeds, but each failure only impacts a proportion of the users). Taking these two optimizations into account, our overall formula for calculating the expected annual loss from a risk is:

(MTTD + MTTR) * (365.25 / MTBF) * percent of affected users

To implement this method for your own application, here is a spreadsheet template that you can copy and populate with your own data: https://goo.gl/bnsPj7

Summary

When analyzing the reliability of an application, it is easy to generate a large list of potential risks that must be prioritized for remediation. We have demonstrated how the MTBF and MTTR values of each risk can be used to develop a prioritized list of risks based on the expected impact on the annual error budget.

We here in CRE have found this method to be extremely helpful. In addition, customers can use the expected loss figure as an input to more comprehensive risk assessments, or cost/benefit calculations of future engineering work. We hope you find it helpful too!

How release canaries can save your bacon – CRE life lessons



The first part of any reliable software release is being able to roll back if something goes wrong; we discussed how we do this at Google in last week’s post, Reliable releases and rollbacks. Once you have that under your belt, you’ll want to understand how to detect that things are starting to go wrong in the first place, with canarying.
Photo taken by David Carroll
The concept of canarying first emerged in 1913 when physiologist John Scott Haldane took the caged bird down into a coal mine, to detect for carbon monoxide. This fragile bird is more susceptible to the odorless gas than humans, and quickly falls off its perch in its presence  signaling to the miners that it’s time to get out!

In software, a canary process is usually the first instance that receives live production traffic about a new configuration update, either a binary or configuration rollout. The new release only goes to the canary at first. The fact that the canary handles real user traffic is key: if it breaks, real users get affected, so canarying should be the first step in your deployment process, as opposed to the last step in testing.

The first step in implementing canarying is a manual process where release engineers trigger the new binary release to the canary instance(s). They then monitor the canary for any signs of increased errors, latency and load. If everything looks good, they then trigger a release to the rest of the production instances.

We here on Google’s SRE teams have found over time that manual inspection of monitoring graphs isn’t sufficiently reliable to detect performance problems or rises in error rates of a new release. When most releases work well, the release engineer gets used to seeing no problems and so, when a low-level problem appears, tends to implicitly rationalize the monitoring anomalies as “noise.” We have several internal postmortems on bad releases whose root cause boils down to “the canary graph wasn’t wiggly enough to make the release engineer concerned.”

We've moved towards automated analysis, where our canary rollout service measures the canary tasks to detect elevated errors, latency and load automatically  and roll back automatically. (Of course, this only works if rollbacks are safe!)

Likewise, if you implement canaries as part of your releases, take care to make it easy to see problems with a release. Consider very carefully how you implement fault tolerance in your canary tasks; it’s fine for the canary to do the best it can with a query, but if it starts to see errors either internally or from its dependency services then it should “squawk loudly” by manifesting those problems in your monitoring. (There’s a good reason why the Welsh miners didn’t breed canaries to be resistant to toxic gases, or put little gas masks on them.)

Client canarying

If you’re doing releases of client software, you should have a mechanism for canarying new versions of the client, and you'll need to answer the following questions:
  1. How will you deploy the new version to only a small percentage of users?
  2. How will you detect if the new version is crash-looping, dropping traffic or showing users errors? (“What's the monitoring sound of no queries happening?”)
A solution for question 2 is for clients to identify themselves to your backend service ideally, by including information in each request about the client’s operating system and application version ID  and for the server to log this information. If you can make the clients identify themselves specifically as canaries, so much the better; this lets you export their stats to a different set of monitoring metrics. To detect that clients are failing to send queries, you'll generally need to know what the lowest plausible amount of incoming traffic is at any given time of the day or week, and trigger an alert if inbound traffic drops below that amount.

Typically, alerting rules for canaries for high-availability systems use a longer evaluation duration (how long you listen to the monitoring signals before deciding you have a problem) than for the main system because the much smaller traffic amount makes the standard signal much noisier; a relatively innocuous problem such as a few service instances being restarted can briefly push the canary error rate above the regular alarm threshold.

Your release should normally aim to cover a wide range of user types but a small fraction of active users. For Android clients, the Google Play Store allows you to deploy a new version of your application package file (APK) to an (essentially random) fraction of users; you can do this on a country-by-country basis. However, see the discussion on Android APK releases below for the limitations and risks in this approach.

Web clients

If your end users access your service via desktop or mobile web rather than an application, you tend to have better control of what’s being executed.

Regular web clients whose UI is managed by JavaScript are fairly easy to control in that you have the potential to deliver updated JavaScript resources to them every time a page loads. However, if you cache JavaScript and similar resources client-side  which is useful in reducing service load and user latency+bandwidth consumption  it’s hard to roll back a bad change. As we discussed in our last post, anything that gets in the way of easy and quick rollbacks is going to be a problem.

One solution is to version your JavaScript files (first release in a /v1/ directory, second in a /v2/ etc.). Then the rollout simply consists of changing the resource links in your root pages to reference the new (or old) versions.

Android APK releases

New versions of an Android app can be rolled out to a % of current users using staged rollouts in the Play Store. This lets you try out a new release of an app on a small subset of your current users; once you have confidence in that release, you can roll it out to more users, and so on.

The % release mechanism marks a percent of users that are eligible to pick up the new release. When their mobile device next checks into the Play Store for updates, it will see an available update for the app and start the update process.

There can be problems with this approach though:
  • You have no control over when eligible-for-update users will actually check in; normally it’ll be within 24 hours, assuming they have adequate connectivity, but this may not be true for users in countries where cellular and Wi-Fi data services are slow and expensive per-byte.
  • You have no control over whether users will accept your update on their mobile device, which can be a particular issue if the new release requires additional permissions.
Following the canarying process described above, you can determine whether your new client release has a problem once your active user base of the canary grows enough for the characteristics of the new traffic become clear: Is there a higher error rate? Is the latency rising? Has traffic to your server mysteriously increased sharply?

If you have a known bad release of your app at version v, the most expedient fix (given the inability to roll back) might be to build your version v-1 code branch into release v+1 and release that, stepping up quickly to 100%. That removes the time pressure to fix the problems detected in code.

Release percentage steps

When you perform a gradual release of a new binary or app, you need to decide in what percentage increments to release your application, and when to trigger the next step in a release. Consider:
  1. The first (canary) step should generate enough traffic for any problems to be clear in your monitoring or logging; normally somewhere between 1% and 10% depending on the size of your user base.
  2. Each step involves significant manual work and delays the overall release. If you step by 3% per day, it will take you a month to do a complete release.
  3. Going up by a single large increment (say, 10% to 100%) can reveal dramatic traffic problems that weren’t apparent at much smaller traffic levels: try not to increase your upgraded user base by more than 2x per step if this is a risk.
  4. If a new version is good, you generally want most of your users to pick it up quickly. If you're doing a rollback, you want to ramp up to 100% much faster than for a new release.
  5. Traffic patterns are often diurnal  typically, highest during the daytime  so you may need at least 24 hours to see the peak traffic load after a release.
  6. In the case of mobile apps, you'll also need to allow time for the users to pick up and start using the new release after they’ve been enabled for it.
If you're looking to roll out an Android app update to most of your users within a few days, you might choose to use a Play Store staged update starting with a 10% rollout that then increases to 50% and finally 100%. Plan for at least 24 hours between release stages and check your monitoring and logging before the next step. This way, a large fraction of your user base picks up the new release within 72 hours of the initial release, and it’s possible to detect most problems before they become too big to handle. For launches where you know there's a risk of significant traffic increase to a service, choose to use steps of 10%, 25%, 50% and 100%  or even more fine-grained increases.

For internal binary releases where you update your service instances directly, you might instead choose to use steps of 1%, 10% then 100%. The 1% release lets you see if there's any gross error in the new release, e.g., if 90% of responses are errors. The 10% release lets you pick up errors or latency increases that are one order of magnitude smaller, and detect any gross performance differences. The third step is normally a complete release. For performance-sensitive systems  generally, those operating at 75%+ of capacity  consider adding a 50% step to catch more subtle performance regressions. The higher the target reliability of a system, the longer you should let each step “bake” to detect problems.

If an ideal marketing launch sequence is 0-100 (everyone gets the new features at once), and the ideal reliability engineer launch sequence is 0-0 (no change means no problems), the “right” launch sequence for an app is inevitably a matter of negotiation. Hopefully the considerations described here give you a principled way to determine a mutually acceptable rollout. The graph below shows you how these various strategies might play out over an 8-day release window.

Summary

In short, we here at Google have developed a software release philosophy that works well for us, for a variety of scenarios:
  • “Rollback early, rollback often.” Try to move your service towards this philosophy, and you’ll reduce the Mean Time To Recover of your service.
  • “Canary your rollouts.” No matter how good your testing and QA, you'll find that your binary releases occasionally have problems with live traffic. An effective canarying strategy and good monitoring can reduce the Mean Time To Detect these problems, and dramatically reduce the number of affected users.
At the end of the day, though, perhaps the best kind of launch is one where the features launched can be enabled independent of the binary rollout. That’s a blog post for another day.

Reliable releases and rollbacks – CRE life lessons



Editor’s note: One of the most common causes of service outages is releasing a new version of the service binaries; no matter how good your testing and QA might be, some bugs only surface when the affected code is running in production. Over the years, Google Site Reliability Engineering has seen many outages caused by releases, and now assumes that every new release may contain one or more bugs.

As software engineers, we all like to add new features to our services; but every release comes with the risk of something breaking. Even assuming that we are appropriately diligent in adding unit and functional tests to cover our changes, and undertaking load testing to determine if there are any material effects on system performance, live traffic has a way of surprising us. These are rarely pleasant surprises.

The release of a new binary is a common source of outages. From the point of view of the engineers responsible for the system’s reliability, that translates to three basic tasks:
  1. Detecting when a new release is actually broken;
  2. Moving users safely from a bad release to a “hopefully” fixed release; and
  3. Preventing too many clients from suffering through a bad release in the first place (“canarying”).
For the purpose of this analysis, we’ll assume that you are running many instances of your service on machines or VMs behind a load balancer such as nginx, and that upgrading your service to use a new binary will involve stopping and starting each service instance.

We’ll also assume that you monitor your system with something like Stackdriver, measuring internal traffic and error rates. If you don’t have this kind of monitoring in place, then it’s difficult to meaningfully discuss reliability; per the Hierarchy of Reliability described in the SRE Book, monitoring is the most fundamental requirement for a reliable system).

Detection

The best case for a bad release is that when a service instance is restarted with the bad release, a major fraction of improperly handled requests generate errors such as HTTP 502, or much higher response latencies than normal. In this case, your overall service error rate rises quickly as the rollout progresses through your service instances, and you realize that your release has a problem.

A more subtle case is when the new binary returns errors on a relatively small fraction of queries - say, a user setting change request, or only for users whose name contains an apostrophe for good or bad reasons. With this failure mode, the problem may only become manifest in your overall monitoring once the majority of your service instances are upgraded. For this reason, it can be useful to have error and latency summaries for your service instance broken down by binary release version.

Rollbacks

Before you plan to roll out a new binary or image to your service, you should ask yourself, “What will I do if I discover a catastrophic / debilitating / annoying bug in this release?” Not because it might happen, but because sooner or later it is going to happen and it is better to have a well-thought out plan in place instead of trying to make one up when your service is on fire.

The temptation for many bugs, particularly if they are not show-stoppers, is to build a quick patch and then “roll forward,” i.e., make a new release that consists of the original release plus the minimal code change necessary to fix the bug (a “cherry-pick” of the fix). We don’t generally recommend this though, especially if the bug in question is user-visible or causing significant problems internally (e.g., doubling the resource cost of queries).

What’s wrong with rolling forward? Put yourself in the shoes of the software developer: your manager is bouncing up and down next to your desk, blood pressure visibly climbing, demanding to know when your fix is going to be released because she has your company’s product director bending her ear about all the negative user feedback he’s getting. You’re coding the fix as fast as humanly possible, because for every minute it’s down another thousand users will see errors in the service. Under this kind of pressure, coding, testing or deployment mistakes are almost inevitable.

We have seen this at Google any number of times, where a hastily deployed roll-forward fix either fails to fix the original problem, or indeed makes things worse. Even if it fixes the problem it may then uncover other latent bugs in the system; you’re taking yourself further from a known-good state, into the wilds of a release that hasn’t been subject to the regular strenuous QA testing.

At Google, our philosophy is that “rollbacks are normal.” When an error is found or reasonably suspected in a new release, the releasing team rolls back first and investigates the problem second. A request for a rollback is not interpreted as an attack on the releasing team, or even the person who wrote the code containing the bug; rather, it is understood as The Right Thing To Do to make the system as reliable as possible for the user. No-one will ask “why did you roll back this change?” as long as the rollback changelist describes the problem that was seen.

Thus, for rollbacks to work, the implicit assumption is that they are:

  1. easy to perform; and
  2. trusted to be low-risk.

How do we make the latter true?

Testing rollbacks

If you haven’t rolled back in a few weeks, you should do a rollback “just because”; aim to find any traps with incompatible versions, broken automation/testing etc. If the rollback works, just roll forward again once you’ve checked out all your logs and monitoring. If it breaks, roll forward to remove the breakage and then focus all your efforts on diagnosing the cause of the rollback breakage. It is better by far to detect this when your new release is working well, rather than being forced off a release that is on fire and having to fight to get back to your known-good original release.

Incompatible changes

Inevitably, there are going to be times when a rollback is not straightforward. One example is when the new release requires a schema change to an in-app database (such as a new column). The danger is that you release the new binary, upgrade the database schema, and then find a problem with the binary that necessitates rollback. This leaves you with a binary that doesn’t expect the new schema, and hasn’t been tested with it.

The approach we recommend here is a feature-free release; starting from version v of your binary, build a new version v+1 which is identical to v except that it can safely handle the new database schema. The new features that make use of the new schema are in version v+2. Your rollout plan is now:
  1. Release binary v+1
  2. Upgrade database schema
  3. Release binary v+2
Now, if there are any problems with either of the new binaries then you can roll back to a previous version without having to also roll back the schema.

This is a special case of a more general problem. When you build the dependency graph of your service and identify all its direct dependencies, you need to plan for the situation where any one of your dependencies is suddenly rolled back by its owners. If your launch is waiting for a dependency service S to move from release r to r+1, you have to be sure that S is going to “stick” at r+1. One approach here is to make an ecosystem assumption that any service could be rolled back by one version, in which case your service would wait for S to reach version r+2 before your service moved to a version depending on a feature in r+1.

Summary

We’ve learned that there’s no good rollout unless you have a corresponding rollback ready to do, but how can we know when to rollback without having our entire service burned to the ground by a bad release?

In part 2 we’ll look at the strategy of “canarying” to detect real production problems without risking the bulk of your production traffic on a new release.

Reimagining support for Google Cloud Platform: new pricing model and partnerships



San Francisco  Cloud customers need a flexible and responsive relationship with their providers. We’ve taken a close look at how we offer support and today are announcing a new model, as well as partnerships with Pivotal and Rackspace, to deliver a closer and more effective way to engage with customers.

When you move to the cloud you’re choosing to bet your business on that platform. The way we see it, this relationship is more than just a typical ‘customer/vendor’ dynamic; it’s a partnership, especially when it comes to support. We started down this path last year when we launched Customer Reliability Engineering (CRE) based on the principles of Site Reliability Engineering (SRE), but we always knew that it was just the first step.

Engineering Support: new role-based model


Support is part of the overall product experience, and we think it should be tailored to your unique technical needs. We’re announcing the all new Engineering Support, a role-based subscription model that allows us to match engineer to engineer, so we can meet you where your business is, no matter what stage of development you’re in.

We know you don’t have just one project or team. We know you're constantly shifting engineers around as software development projects move from concept to development to production. With this in mind, we took a look at support and asked: Why should you be forced to pick just one support plan for your whole company? Why should you be locked into paying for a multi-year support contract? Why should you pay more for support as you spend more on the platform? If we’re doing our jobs, then you should spend less over time.

In rethinking our approach, and after listening to the feedback from customers, we focused on three principles:

  • Predictability - Flat fees per user per month; no variable percentage-of-platform-spend charges; You should know on day one what your costs will be on day 30.
  • Customizability - You should be able to configure your support entitlements to match the exact needs of your business.
  • Flexibility - You should be able to change the level of support from month-to-month as the unique needs of your business evolve.

With Engineering Support, we’ll offer three choices per support seat:

  • Development engineering support is ideal for developers or QA engineers that can manage with a response within four to eight business hours, priced at $100/user per month.
  • Production engineering support provides a one-hour response time for critical issues at $250/user per month.
  • On-call engineering support pages a Google engineer and delivers a 15-minute response time 24x7 for critical issues at $1,500/user per month.

With this new model, you pay for only the roles your team needs and can decide what time-frame of support responses best suits the lifecycle stages of your applications and who in your organization needs to interact with support.

The advantages of the Engineering Support model are:
  • You can mix and match your support levels and spend to the stages of development maturity for your projects. You can add, remove or change support levels monthly, from our Cloud Console. No more buying the highest tier for the whole company just because one project needs a 15-minute response time.
  • Prices are fixed so you know on the first day of the month what your support bill will be on the last day of the month. No more of the dreaded “success tax” where your support bill increases with cloud usage.
  • You can make adjustments month-to-month as your needs evolve, changing your support needs with shifts in your business.

Our goal is for Engineering Support to eventually replace the “precious metal” (e.g., silver, gold) tiers that link support costs directly to cloud usage. We're aiming to roll out Engineering Support with new customers this spring. We’ll also work with our existing customers to move them to the new model over the course of the year.

And, there's an option to not pay for support at all. As always, we support every customer for quota increase requests and billing questions at no cost. Further, forums, communities, issue lists and product documentation are accessible by all, 24x7.

Watch for more info soon at cloud.google.com/support.

Pivotal Cloud Foundry first CRE partner

We support our customers’ needs to use the mix of cloud services that is best for their business. And since we launched the CRE program, we’ve known that it’s important to have an ecosystem of technology partners to extend this mix. So we’re excited to announce that we’ve brought on Pivotal as our first CRE technology partner.

CRE technology partners will work hand-in-hand with Google to thoroughly review their solutions and implement changes to address identified risks to reliability. Additionally, we'll continuously work with our CRE partners to ensure that the partner solution’s configuration, architecture and conventions allow users to create highly reliable and available applications that are in line with CRE’s best practices.

We’re working closely with Pivotal to make sure that GCP customers who choose to use Pivotal Cloud Foundry can feel comfortable that they’re going to build and deploy highly reliable systems by default.

Rackspace Support for GCP

To ensure our customers are covered beyond our own support teams, we’re partnering with Rackspace to offer managed support for GCP. The Rackspace team is actively building their GCP practice, and we are providing them with the resources and tools they need to deliver the best experience to our joint customers.

Our goal is to make GCP the best choice for Rackspace customers looking to move to cloud by creating a high level of integration and experience. Rackspace already has Google Certified Professionals on staff, and expects to begin onboarding beta customers for GCP managed support in the coming months.

Visit us at Next

If you're at Google Cloud Next '17 this week, please stop by The Fifth Nine lounge to learn more about SRE and DevOps. We’re hosting some great sessions including SRE War Stories, where panelists will discuss memorable SRE events in Google history, talks by customers and partners, and a fireside chat with SRE creators Benjamin Treynor Sloss and Ben Lutch. See you there!