Available . . . or not? That is the question – CRE life lessons



In our last installment of the CRE life lessons series, we discussed how to survive a "success disaster" with load-shedding techniques. We got a lot of great feedback from that post, including several questions about how to tie measurements to business objectives. So, in this post, we decided to go back to first principles, and investigate what “success” means in the first place, and how to know if your system is “succeeding” at all.

A prerequisite to success is availability. A system that's unavailable cannot perform its function and will fail by default. But what is "availability"? We must define our terms:

Availability defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future. Sometimes availability is measured by using a count of requests rather than time directly. In either case, the structure of the formula is the same: successful units / total units. For example, you might measure uptime / (uptime + downtime), or successful requests / (successful requests + failed requests). Regardless of the particular unit used, the result is a percentage like 99.9% or 99.999%  sometimes referred to as “three nines” or “five nines.”

Achieving high availability is best approached by focusing on the unsuccessful component (e.g., downtime or failed requests). Taking a time-based availability metric as an example: given a fixed period of time (e.g., 30 days, 43200 minutes) and an availability target of 99.9% (three nines), simple arithmetic shows that the system must not be down for more than 43.2 minutes over the 30 days. This 43.2 minute figure provides a very concrete target to plan around, and is often referred to as the error budget. If you exceed 43.2 minutes of downtime over 30 days, you'll not meet your availability goal.

Two further concepts are often used to help understand and plan the error budget:

Mean Time Between Failures (MTBF): total uptime / # of failures. This is the average time between failures.

Mean Time to Repair (MTTR): total downtime / # of failures. This is the average time taken to recover from a failure.

These metrics can be computed historically (e.g., over the past 3 months, or year) and combined as (Total Period / MTBF) * MTTR to give an expected downtime value. Continuing with the above example, if the historical MTBF is calculated to be 10 days, and the historical MTTR is calculated to be 20 minutes, then you would expect to see 60 minutes of downtime ((30 days / 10 days) * 20 minutes)  clearly outside the 44-minute error budget for a three-nines availability target. To meet the target would require decreasing the MTBF (say to every 20 days) or decreasing the MTTR (say to 10 minutes), or a combination of both.

Keeping the concepts of error budget, MTBF and MTTR in mind when defining an availability target helps to provide justification for why the target is set where it is. Rather than simply describing the target as a fixed number of nines, it's possible to relate the numeric target to the user experience in terms of total allowable downtime, frequency and duration of failure.

Next, we'll look at how to ensure this focus on user experience is maintained when measuring availability.


Measuring availability


How do you know whether a system is available? Consider a fictitious "Shakespeare" service, which allows users to find mentions of a particular word or phrase in Shakespeare’s texts. This is a canonical example, used frequently within Google for training purposes, and mentioned throughout the SRE book.

Let's try working the scientific method to determine the availability of the hypothetical Shakespeare system.
  1. Question: how often is the system available?
  2. Observation: when you visit shakespeare.com, you normally get back the "200 OK" status code and an HTML blob. Very rarely, you see a 500 Internal Server error or a connection failure.
  3. Hypothesis: if "availability" is the percentage of requests per day that return 200 OK, the system will be 99.9% available.
  4. Measure: "tail" the response logs of the Shakespeare service’s web servers and dump them into a logs-processing system.
  5. Analyze: take a daily availability measurement as the percentage of 200 OK responses vs. the total number of requests.
  6. Interpret: After seven days, there’s a minimum of 99.7% availability on any given day.

Happily, you report these availability numbers to your boss (Dave), and go home. A job well done.

The next day Dave draws your attention to the support forum. Users are complaining that all their searches at shakespeare.com return no results. Dave asks why the availability dashboard shows 99.7% availability for the last day, when there clearly is a problem.

You check the logs and notice that the web server has received just 1000 requests in the last 24 hours, and they're all 200 OKs except for three 500s. Given that you expect at least 100 queries per second, that explains why users are complaining in the forums, although the dashboard looks fine.

You've made the classic mistake of basing your definition of availability on a measurement that does not match user-expectations or business objectives.


Redefining availability in terms of the user experience with black-box monitoring


After fixing the critical issue (a typo in a configuration file) that prevented the Shakespeare frontend service from reaching the backend, we take a step back to think about what it means for our system to be available.

If the "rate of 200 OK logs for shakespeare.com" is not an appropriate availability measurement, then how should we measure availability?

Dave wants to understand the availability as observed by users. When does the user feel that shakespeare.com is available? After some lively back-and-forth, we agree that the system is available when a user can visit shakespeare.com, enter a query and get a result for that query within five seconds, 100% of the time.

So you write a black-box "prober" (black-box, because it makes no assumptions about the implementation of the Shakespeare service, see the SRE Book, Chapter 6) to emulate a full range of clients devices (mobile, desktop). For each type of client, you visit shakespeare.com, enter the query "to be or not to be," and check that the result contains the expected link to Hamlet. You run the prober for a week, and finally recalculate the minimum daily availability measure: 80% of queries return Hamlet within five seconds, 18% of queries take longer, 1% timeout and another 1% return errors. A full 20% of queries fail our definition of availability!


Choosing an availability target according to business goals


After getting over his shock, Dave asks a simple question: “Why can't we have 100% returning within 5 seconds?”

You explain all the usual reasons why: power outages, fiber cuts, etc. After an hour or so, Dave is willing to admit that 100% query response in under five seconds is truly impossible.

Which leads, Dave to ask, “What availability can we have, then?”

You turn the question the question around on him: “What availability is required for us to meet our business goals?”

Dave's eyes light up. The business has set a revenue target of $25 million per year, and we make on average $0.01 per query result. At 100 queries per second * 3,1536,000 seconds per year * 80% success rate * $0.01 per query, we'll earn $25.23 million. In other words, even with a 20% failure rate, we'll still hit our revenue targets!

Still, a 20% failure rate is pretty ugly. Even if we think we'll meet our revenue targets, it's not a good user experience and we might have some attrition as a result. Should we fix it, and if so, what should our availability objective be?

Evaluating cost/benefit tradeoffs, opportunity costs


Suppose the rate of queries returning in greater than five seconds can be reduced to 0.5% if an engineer works on the problem for six months. How should we decide whether or not to do this?

We can start by estimating how much the 20% failure rate is going to cost us in missed revenue (accounting for users who give up on retrying) over the life of the product. We know roughly how much it will cost to fix the problem. Naively, we may decide that since the revenue lost due to the error rate exceeds the cost of fixing the issue, then we should fix it.

But this ignores a crucial factor… the opportunity cost of fixing the problem. What other things could an engineer have done with that time instead?

Hypothetically, there’s a new search algorithm that increases the relevance of Shakespeare search results, and putting it into production might drive a 20% increase in search traffic, even as availability remains constant. This increase in traffic could easily offset any lost revenue due to poor availability.

An oft-heard SRE saying is that you should “design a system to be as available as is required, but not much more.” At Google, when designing a system, we generally target a given availability figure (e.g., 99.9%), rather than particular MTBF or MTTR figures. Once we’ve achieved that availability metric, we optimize our operations for "fast fix," e.g., MTTR over MTBF, accepting that failure is inevitable, and that “spes consilium non est” (Hope is not a strategy). SREs are often able to mitigate the user visible impact of huge problems in minutes, allowing our engineering teams to achieve high development velocity, while simultaneously earning Google a reputation for great availability.

Ultimately, the tradeoff made between availability and development velocity belong to the business. Precisely defining the availability in product terms allows us to have a principled discussion and to make choices we can be proud of.

N.B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai and other luminaries for three days of keynotes, code labs, certification programs and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.