Getting the most out of shared postmortems — CRE life lessons

By Adrian Hilton and Gwendolyn Stockman, Customer Reliability Engineers, and Dave Rensin, Director of Customer Reliability Engineering

In our previous post we discussed the benefits of sharing internal postmortems outside your company. You may adopt a one:many approach with an incident summary that tells all your customers what happened and how you'll prevent it from happening again. Or, if the incident impacted a major customer, you may share something close to your original postmortem with them.

In this post, we consider how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices. We also present a worked example of a shared postmortem based on the SRE Book postmortem template.

Postmortems should fix your customer too

How to get outages to benefit everyone

Even if the fault was 100% on you, the platform side, an external postmortem can still help customers improve their reliability. Now that we know what happens when a particular failure occurs, how can we generalize this to help the customer mitigate the impact, and reduce MTTD and MTTR for a similar incident in the future?

One of the best sources of data for any postmortem is your customers’ SLOs, with their ability to measure the impact of a platform outage. Our CRE team talks about SLOs quite a lot in the CRE Life Lessons series, and there’s a reason why: SLOs and error budgets inform more than just whether to release features in your software.

For customers with defined SLOs who suffered a significant error budget impact, we recommend conducting a postmortem review with them. The review is partly to ensure that the customer’s concerns were addressed, but also to identify “what went wrong,” “where we got lucky” and how to identify actions which would address these for the customer.

For example, the platform’s storage service suffered increased latency for a certain class of objects in a region. This is not the customer’s fault, but they may still be able to do something about it.

The internal postmortem might read something like:

What went well

The shared monitoring implemented with CustomerName showed a clear single-region latency hit which resulted in a quick escalation to storage oncall.

What went wrong

A new release of the storage frontend suffered from a performance regression for uncached reads that was not detected during testing or rollout.

Where we got lucky

Only reads of objects between 4KB and 32KB in size were materially affected.

Action items

Add explicit read/write latency testing in testing for both cached and uncached objects in buckets of 1KB, 4KB, 32KB, …
Have paging alerts for latency over SLO limits, aggregated by Cloud region, for both cached and uncached objects, in buckets of 1KB, 4KB, 32KB, ...

When a customer writes their own postmortem about this incident, using the shared postmortem to understand better what broke in the platform and when, that postmortem might look like:

What went well

We had anticipated a generic single-region platform failure and had the capability to fail over out of an affected region.

What went wrong

Although the latency increase was detected quickly, we didn’t have accessible thru-stack monitoring that could show us that it was coming from platform storage-service rather than our own backends.
Our decision to fail out of the affected region took nearly 30 minutes to complete because we had not practiced it for one year and our playbook instructions were out of date.

Where we got lucky

This happened during business hours so our development team was on hand to help diagnose the cause.

Action items

Add explicit dashboard monitoring for aggregate read and write latency to and from platform storage-service.
Run periodic (at least once per quarter) test failovers out of a region to validate that the failover instructions still work and increase ops team confidence with the process.

Prioritize and track your action items

A postmortem isn’t complete until the root causes have been fixed

Sharing the current status of your postmortem action items is tricky. It's unlikely that the customer will be using the same issue tracking system as you are, so neither side will have a “live” view of which action items from a postmortem have been resolved, and which are still open. Within Google we have automation which tracks this and “reminds” us of unclosed critical actions from postmortems, but customers can’t see those unless we surface them in the externally-visible part of our issue tracking system, which is not our normal practice.

Currently, we hold a monthly SLO review with each customer, where we list the major incidents and postmortem/incident report for each incident; we use that occasion to report on open critical bug statuses from previous months’ incidents, and check to see how the customer is doing on their actions.

Other benefits

Opening up is an opportunity

There are practical reliability benefits of sharing postmortems, but there are other benefits too. Customers who are evolving towards an SRE culture and adopting blameless postmortems can use the external postmortem as a model for their own internal write-ups. We’re the first to admit that it’s really hard to write your own first postmortem from scratch—having a collection of “known-good” postmortems as a reference can be very helpful.

At a higher level, shared postmortems give your customer a “glimpse behind the curtain.” When a customer moves from on-premises hardware to the cloud, it can be frightening; they're giving up a lot of control of and visibility into the platform on which their service runs. The cloud is expected to encapsulate the operational details of the services it offers, but unfortunately it can be guilty of hiding information that the customer really wants to see. A detailed external postmortem makes that information visible, giving the customer a timeline and deeper detail, which hopefully they can relate to.

Joint postmortems

If you want joint operations, you need joint postmortems

The final step in the path to shared postmortems is creating a joint postmortem. Until this point, we’ve discussed how to externalize an existing document, where the action items, for example, are written by you and assigned to you. With some customers, however, it makes sense to do a joint postmortem where you both contribute to all sections of the document. It will not only reflect your thoughts from the event, but it will also capture the customer’s thoughts and reactions, too. It will even include action items that you assign to your customer, and vice-versa!

Of course, you can’t do joint postmortems with large numbers of your customers, but doing so with at least a few of them helps you (a) build shared SRE culture, and (b) keep the customer perspective in your debugging, design and planning work.

Joint postmortems are also one of the most effective tools you have to persuade your product teams to re-prioritize items on their roadmap, because they present a clear end-user story of how those items can prevent or mitigate future outages.

Summary

Sharing your postmortems with your customers is not an easy thing to do; however, we have found that it helps:

Gain a better understanding of the impact and consequences of your outages
Increase the reliability of your customers’ service
Give customers confidence in continuing to run on your platform even after an outage.

To get you started, here's an example of an external postmortem for the aforementioned storage frontend outage, using the SRE Book postmortem template. (Note: Text relating to the customer (“JaneCorp”) is marked in purple for clarity.) We hope it sets you on the path to learning and growing from your outages. Happy shared postmortem writing!

googblogs.com

All Google blogs and Press in one site