Tag Archives: CRE

SLOs, SLIs, SLAs, oh my – CRE life lessons



Last week on CRE life lessons, we discussed how to come up with a precise numerical target for system availability. We term this target the Service Level Objective (SLO) of our system. Any discussion we have in future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.

We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load balancing between the two.

Why have an SLO at all?

Suppose that we decide that running our aforementioned Shakespeare service against a formally defined SLO is too rigid for our tastes; we decide to throw the SLO out of the window and make the service “as available as is reasonable.” This makes things easier, no? You simply don’t mind if the system goes down for an hour now and then. Indeed, perhaps downtime is normal during a new release and the attending stop-and-restart.

Unfortunately for you, customers don’t know that. All they see is that Shakespeare searches that were previously succeeding have suddenly started to return errors. They raise a high-priority ticket with support, who confirms that they see the error rate and escalate to you. Your on-call engineer investigates, confirms this is a known issue, and responds to the customer with “this happens now and again, you don’t have to escalate.” Without an SLO, your team has no principled way of saying what level of downtime is acceptable; there's no way to measure whether or not this a significant issue with the service. and you cannot terminate the escalation early with “Shakespeare search service is currently operating within SLO.” As our colleague Perry Lorier likes to say, “if you have no SLOs, toil is your job.”

The SLO you run at becomes the SLO everyone expects


A common pattern is to start your system off at a low SLO, because that’s easy to meet: you don’t want to run a 24/7 rotation, your initial customers are OK with a few hours of downtime, so you target at least 99% availability  1.68 hours downtime per week. But in fact, your system is fairly resilient and for six months operates at 99.99% availability  down for only a few minutes per month.

But then one week, something breaks in your system and it’s down for a few hours. All hell breaks loose. Customers page your on-call complaining that your system has been returning 500s for hours. These pages go unnoticed, because on-call leaves their pagers on their desks overnight, per your SLO which only specifies support during office hours.

The problem is, customers have become accustomed to your service being always available. They’ve started to build it into their business systems on the assumption that it’s always available. When it’s been continually available for six months, and then goes down for a few hours, something is clearly seriously wrong. Your excessive availability has become a problem because now it’s the expectation. Thus the expression, “An SLO is a target from above and below”  don’t make your system very reliable if you don’t intend and commit to it to being that reliable.

Within Google, we implement periodic downtime in some services to prevent a service from being overly available. In the SRE Book, our colleague Marc Alvidrez tells a story about our internal lock system  Chubby. Then, there’s the set of test front-end servers for internal services to use in testing, allowing those services to be accessible externally. These front-end servers are convenient but are explicitly not intended for use by real services; they have a one business day support SLA, and so can be down for 48 hours before the support team is even obligated to think about fixing them. Over time, experimental services that used those front-ends started to become critical; when we finally had a few hours of downtime on the front-ends, it caused widespread consternation.

Now we run a quarterly planned-downtime exercise with these front-ends. The front-end owners send out a warning, then block all services on the front-ends except for a small whitelist. They keep this up for several hours, or until a major problem with the blockage appears; the blockage can be quickly reversed in that case. At the end of the exercise the front-end owners receive a list of services that use the front-ends inappropriately, and work with the service owners to move them to somewhere more suitable. This downtime exercise keeps the front-end availability suitably low, and detects inappropriate dependencies in time to get them fixed.

Your SLA is not your SLO


At Google, we distinguish between a Service-Level Agreement (SLA) and a Service-Level Objective (SLO). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they'll push hard to keep it within SLA.

Because of this, and because the principle availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over 1 month with an internal availability SLO of 99.95%. Alternatively the SLA might only specify a subset of the metrics comprising the SLO.

For example, with our Shakespeare search service, we might decide to provide it as an API to paying customers in which a customer pays us $10K per month for the right to send up to one million searches per day. Now that money is involved, we need to specify in the contract how available they can expect the service to be, and what happens if we breach that agreement. We might say that we'll provide the service at a minimum of 99% availability, following the definition of successful queries given previously. If the service drops below 99% availability in a month, then we'll refund $2K; if it drops below 80% then, we'll refund $5K.

If you have an SLA that's different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You'll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries (we might not mind dropping queries from non-paying users if we have to start load shedding, but we really care about any query from the paying customer that we fail to handle properly). That’s another benefit of establishing an SLA  it’s an unambiguous way to prioritize traffic.

When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, suppose that you give each of three major customers (whose traffic dominates your service) a quota of one million queries per day. One of your customers releases a buggy version of their mobile client, and issues two million queries per day for two days before they revert the change. Over a 30-day period you’ve issued approximately 90 million good responses, and two million errors; that gives you a 97.8% success rate. You probably don’t want to give all your customers a refund as a result of this; two customers had all their queries succeed, and the customer for whom two million out of 32 million queries were rejected brought this upon themselves. So perhaps you should exclude all “out of quota” response codes from your SLA accounting.

On the other hand, suppose you accidentally push an empty quota specification file to your service before going home for the evening. All customers receive a default 1000 queries per day quota. Your three top customers get served constant “out of quota” errors for 12 hours until you notice the problem when you come into work in the morning, and revert the change. You’re now showing 1.5 million rejected queries out of 90 million for the month, a 98.3% success rate. This is all your fault: counting this as 100% success for 88.5M queries is missing the point and a moral failure for measuring the SLA.

Conclusion


SLIs, SLOs and SLAs aren’t just useful abstractions. Without them you cannot know if your system is reliable, available, or even useful. If they don’t tie explicitly back to your business objectives then you have no idea if the choices you make are helping or hurting your business. You also can’t make honest promises to your customers.

If you’re building a system from scratch, make sure that SLIs, SLOs and SLAs are part of your system requirements. If you already have a production system but don’t have them clearly defined then that’s your highest priority work.

To summarize:
  • If you want to have a reliable service, you must first define “reliability.” In most cases that actually translates to availability.
  • If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries; these will form the basis of your SLIs.
  • The more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with, and state that as your Service Level Objective (SLO).
  • Without an SLO, your team and your stakeholders cannot make principled judgements about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development).
  • If you’re charging your customers money you'll probably need an SLA, and it should be a little bit looser than your SLO.

As an SRE (or DevOps professional), it's your responsibility to understand how your systems serve the business in meeting those objectives, and, as much as possible, control for risks that threaten the high-level objective. Any measure of system availability that ignores business objectives is worse than worthless because it obfuscates the actual availability, leading to all sorts of dangerous scenarios, false senses of security and failure.

For those of you who wrote us thoughtful comments and questions from our last article, we hope this post has been helpful. Keep the feedback coming!

N. B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai, and other luminaries for three days of keynotes, code labs, certification programs, and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.

Available . . . or not? That is the question – CRE life lessons



In our last installment of the CRE life lessons series, we discussed how to survive a "success disaster" with load-shedding techniques. We got a lot of great feedback from that post, including several questions about how to tie measurements to business objectives. So, in this post, we decided to go back to first principles, and investigate what “success” means in the first place, and how to know if your system is “succeeding” at all.

A prerequisite to success is availability. A system that's unavailable cannot perform its function and will fail by default. But what is "availability"? We must define our terms:

Availability defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future. Sometimes availability is measured by using a count of requests rather than time directly. In either case, the structure of the formula is the same: successful units / total units. For example, you might measure uptime / (uptime + downtime), or successful requests / (successful requests + failed requests). Regardless of the particular unit used, the result is a percentage like 99.9% or 99.999%  sometimes referred to as “three nines” or “five nines.”

Achieving high availability is best approached by focusing on the unsuccessful component (e.g., downtime or failed requests). Taking a time-based availability metric as an example: given a fixed period of time (e.g., 30 days, 43200 minutes) and an availability target of 99.9% (three nines), simple arithmetic shows that the system must not be down for more than 43.2 minutes over the 30 days. This 43.2 minute figure provides a very concrete target to plan around, and is often referred to as the error budget. If you exceed 43.2 minutes of downtime over 30 days, you'll not meet your availability goal.

Two further concepts are often used to help understand and plan the error budget:

Mean Time Between Failures (MTBF): total uptime / # of failures. This is the average time between failures.

Mean Time to Repair (MTTR): total downtime / # of failures. This is the average time taken to recover from a failure.

These metrics can be computed historically (e.g., over the past 3 months, or year) and combined as (Total Period / MTBF) * MTTR to give an expected downtime value. Continuing with the above example, if the historical MTBF is calculated to be 10 days, and the historical MTTR is calculated to be 20 minutes, then you would expect to see 60 minutes of downtime ((30 days / 10 days) * 20 minutes)  clearly outside the 44-minute error budget for a three-nines availability target. To meet the target would require decreasing the MTBF (say to every 20 days) or decreasing the MTTR (say to 10 minutes), or a combination of both.

Keeping the concepts of error budget, MTBF and MTTR in mind when defining an availability target helps to provide justification for why the target is set where it is. Rather than simply describing the target as a fixed number of nines, it's possible to relate the numeric target to the user experience in terms of total allowable downtime, frequency and duration of failure.

Next, we'll look at how to ensure this focus on user experience is maintained when measuring availability.


Measuring availability


How do you know whether a system is available? Consider a fictitious "Shakespeare" service, which allows users to find mentions of a particular word or phrase in Shakespeare’s texts. This is a canonical example, used frequently within Google for training purposes, and mentioned throughout the SRE book.

Let's try working the scientific method to determine the availability of the hypothetical Shakespeare system.
  1. Question: how often is the system available?
  2. Observation: when you visit shakespeare.com, you normally get back the "200 OK" status code and an HTML blob. Very rarely, you see a 500 Internal Server error or a connection failure.
  3. Hypothesis: if "availability" is the percentage of requests per day that return 200 OK, the system will be 99.9% available.
  4. Measure: "tail" the response logs of the Shakespeare service’s web servers and dump them into a logs-processing system.
  5. Analyze: take a daily availability measurement as the percentage of 200 OK responses vs. the total number of requests.
  6. Interpret: After seven days, there’s a minimum of 99.7% availability on any given day.

Happily, you report these availability numbers to your boss (Dave), and go home. A job well done.

The next day Dave draws your attention to the support forum. Users are complaining that all their searches at shakespeare.com return no results. Dave asks why the availability dashboard shows 99.7% availability for the last day, when there clearly is a problem.

You check the logs and notice that the web server has received just 1000 requests in the last 24 hours, and they're all 200 OKs except for three 500s. Given that you expect at least 100 queries per second, that explains why users are complaining in the forums, although the dashboard looks fine.

You've made the classic mistake of basing your definition of availability on a measurement that does not match user-expectations or business objectives.


Redefining availability in terms of the user experience with black-box monitoring


After fixing the critical issue (a typo in a configuration file) that prevented the Shakespeare frontend service from reaching the backend, we take a step back to think about what it means for our system to be available.

If the "rate of 200 OK logs for shakespeare.com" is not an appropriate availability measurement, then how should we measure availability?

Dave wants to understand the availability as observed by users. When does the user feel that shakespeare.com is available? After some lively back-and-forth, we agree that the system is available when a user can visit shakespeare.com, enter a query and get a result for that query within five seconds, 100% of the time.

So you write a black-box "prober" (black-box, because it makes no assumptions about the implementation of the Shakespeare service, see the SRE Book, Chapter 6) to emulate a full range of clients devices (mobile, desktop). For each type of client, you visit shakespeare.com, enter the query "to be or not to be," and check that the result contains the expected link to Hamlet. You run the prober for a week, and finally recalculate the minimum daily availability measure: 80% of queries return Hamlet within five seconds, 18% of queries take longer, 1% timeout and another 1% return errors. A full 20% of queries fail our definition of availability!


Choosing an availability target according to business goals


After getting over his shock, Dave asks a simple question: “Why can't we have 100% returning within 5 seconds?”

You explain all the usual reasons why: power outages, fiber cuts, etc. After an hour or so, Dave is willing to admit that 100% query response in under five seconds is truly impossible.

Which leads, Dave to ask, “What availability can we have, then?”

You turn the question the question around on him: “What availability is required for us to meet our business goals?”

Dave's eyes light up. The business has set a revenue target of $25 million per year, and we make on average $0.01 per query result. At 100 queries per second * 3,1536,000 seconds per year * 80% success rate * $0.01 per query, we'll earn $25.23 million. In other words, even with a 20% failure rate, we'll still hit our revenue targets!

Still, a 20% failure rate is pretty ugly. Even if we think we'll meet our revenue targets, it's not a good user experience and we might have some attrition as a result. Should we fix it, and if so, what should our availability objective be?

Evaluating cost/benefit tradeoffs, opportunity costs


Suppose the rate of queries returning in greater than five seconds can be reduced to 0.5% if an engineer works on the problem for six months. How should we decide whether or not to do this?

We can start by estimating how much the 20% failure rate is going to cost us in missed revenue (accounting for users who give up on retrying) over the life of the product. We know roughly how much it will cost to fix the problem. Naively, we may decide that since the revenue lost due to the error rate exceeds the cost of fixing the issue, then we should fix it.

But this ignores a crucial factor… the opportunity cost of fixing the problem. What other things could an engineer have done with that time instead?

Hypothetically, there’s a new search algorithm that increases the relevance of Shakespeare search results, and putting it into production might drive a 20% increase in search traffic, even as availability remains constant. This increase in traffic could easily offset any lost revenue due to poor availability.

An oft-heard SRE saying is that you should “design a system to be as available as is required, but not much more.” At Google, when designing a system, we generally target a given availability figure (e.g., 99.9%), rather than particular MTBF or MTTR figures. Once we’ve achieved that availability metric, we optimize our operations for "fast fix," e.g., MTTR over MTBF, accepting that failure is inevitable, and that “spes consilium non est” (Hope is not a strategy). SREs are often able to mitigate the user visible impact of huge problems in minutes, allowing our engineering teams to achieve high development velocity, while simultaneously earning Google a reputation for great availability.

Ultimately, the tradeoff made between availability and development velocity belong to the business. Precisely defining the availability in product terms allows us to have a principled discussion and to make choices we can be proud of.

N.B. Google Cloud Next '17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai and other luminaries for three days of keynotes, code labs, certification programs and over 200 technical sessions. And for the first time ever, Next '17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.

Using load shedding to survive a success disaster – CRE life lessons



Editor’s note: Just because something is a good problem to have, doesn’t mean it’s not a problem. In this latest installment of the CRE life lessons series, we learn about techniques that the Google Site Reliability Engineering team uses to handle too much of a good thing (traffic) with grace  and how you can apply them to your own code running on Google Cloud Platform (GCP).

In our last installment in this series, we talked about how to prevent an accidental DDoS from your own code. In this post, we’ll talk about what to do when you have the problem everybody hopes for the success disaster.

The most painful kind of software failure is the "success disaster." This happens when your application consistently gets more traffic than you can sustainably handle. While you scramble to add capacity, your users may start to get the idea that it’s not worth the effort to use your system and eventually leave for something else.

What makes this the worst sort of failure is that nobody thinks it will happen to them while simultaneously hoping it does. It’s an entirely avoidable problem. Embrace the practice of load shedding and spare yourself the pain of this regret. Load shedding is a technique that allows your system to serve nominal capacity, regardless of how much traffic is being sent to it, in order to maintain availability. To do this, you'll need to throw away some requests and make clients retry.

Procrustean load shedding

You may recall, Poseidon’s son Procrustes had a very, um, one-size-fits-all approach to accommodating his overnight guests. In its simplest form, load shedding can be a bit like that too: observe some easily obtained local measure like CPU load, memory utilization or request queue length, and when this load number crosses a predetermined “safe” level as established by load testing, drop a fraction of incoming traffic to bring the load back to safe levels. For example, the system may drop the first n of each 10 requests where n starts at 1, ramps up as system load stays high, and drops gradually as the load returns to safe levels.

For example, here’s a Python method that processes a new request while keeping the load under a hard limit of 45 units and pushing down towards a soft limit of 25 units:

def addRequest(self, r):
    
    HARD_QUOTA = 45
    SOFT_QUOTA = 25
    STEPS = 10

    divisor = (HARD_QUOTA - SOFT_QUOTA) / STEPS
 
    self.received += 1
    self.req_modulus = (self.req_modulus + 1) % STEPS
    
    # Are we overloaded?
    load = self.getLoad()
    
    # Become progressively more likely to reject requests
    # once load > soft quota; reject everything once load
    # hits hard limit.

    threshold = int((HARD_QUOTA - load) / divisor)


    if self.req_modulus < threshold:
      # We're not too loaded
      self.active_requests.append(r)
      self.accepted += 1
    else:
      self.rejected += 1


When you feed a varying load into this system, you get the behavior seen below:
In the system modeled requests expire after a fixed time and are of varying cost. At a normal request rate (0-2 requests/sec) the system is running comfortably within limits. When requests double at t=30, the load shedding kicks in; we start to see a rise in rejected and expired requests but the load is kept under the hard limit. Rejected requests are more common than expired ones, which is what we want as expired requests consume system resources for no utility. Once the request rate returns to normal at t=90, new rejected and expired requests stop. Between t=120 and t=150 there's a 50% rise in requests, which reactivates load shedding but at a lower rate.

This kind of load shedding is simple to implement and is definitely better than having no load shedding at all, but it also has at least one very serious drawback: it assumes that all types of requests and clients are equal. In our experience, this is seldom true. If 95% of your online store’s requests are people paging through your catalog, and 5% are actual purchase requests, wouldn’t you want to prioritize the latter? A Procrustean approach to load shedding won’t help you with this.

Fortunately there are alternatives.

Ranking requests for criticality and cost

Before you can safely throw away less valuable work (i.e., drop requests, refuse connections, etc.) you first have to rank the relative importance of each request. That means figuring out what a request costs.

Every request has two costs:
  1. The cost to perform the work (the direct cost)
  2. The cost to not perform the work (the opportunity cost)
Direct cost is usually expressed in terms of a finite computing resource like CPU, RAM or network bandwidth. In our experience, however, this most usually resolves to CPU, as RAM is often already over-provisioned relative to CPU. (Networks can sometimes be the scarce resource, but normally only for specialty cases.)

Opportunity cost, on the other hand, is a little trickier to calculate. How do you measure the cost of not doing something? It’s tempting to try to express it in terms of dollars but that’s usually an oversimplification. Dollars of revenue are not the same as dollars of profit. One might be vitally more important to your business than another. With that in mind, here are two rules to remember when thinking about this:
  1. Denominate your costs in terms of your scarcest resource. If CPU is the scarcest thing in your system then use that to express all of your costs. If it’s revenue or profit then use that. At Google, for example, we sometimes use engineering hours as a measure of cost because we perceive engineering time as more scarce than dollars.
  2. Get everyone to agree on the units before you start ranking request types. Different parts of your business will have different views of the costs of dropping traffic. The ads team might value the dollars in lost revenue for not serving a piece of content while your marketing team might value the total number of users that can simultaneously access your application. The UX team, on the other hand, might think that latency is the most important thing since laggy UIs make users grumpy. The point is that this all gets settled by deciding on the denominating units first!
Once you decide how to measure the costs of dropped work then you can stack-rank the requests to shed. This is known as establishing your criticality. The more critical traffic gets prioritized ahead of the less critical traffic.

Of course, even this has its nuances. For example, some load shedding systems are designed to minimize the aggregate opportunity cost in the system while others consider how the opportunity costs and direct costs relate to each other. (Known as weighted or scaled costs.)

It’s almost never possible to know either the direct or opportunity cost of a specific query at runtime. Even if you could know, it’s likely that the computational overhead of calculating it in-line for every request would seriously reduce your serving capacity. Instead, you should establish a few criticality buckets or classes for your known request types. This way you can more easily classify each request into one of the buckets and use that to stack-rank their priorities. (Those of you with networking backgrounds will recognize this as a key component of Quality of Service (QoS) systems.)

Setting criticality


For load shedding to work best, your system should determine the criticality bucket of a request as early as possible, and as cheaply as possible, based on your criteria. Some common examples of how to determine criticality include:
  • An explicit field in the request specifying the bucket.
  • Bucketing by the hostname, which lets you "black-hole" low-priority traffic in overload situations by using DNS to point to a sacrificial server. This is a big hammer, but occasionally life-saving because it can stop requests from reaching your overloaded service in the first place.
  • The URL path, which is fairly cheap to check though does require some extra processing by your front-end service.
  • User ID, and whether it belongs to a specific group, e.g.,"paying customers" (highest), "logged in users" (medium-high), "logged-out users" (medium-low), "known robot accounts" (lowest). This allows the most precise bucketing, but is more expensive to check for each request.
At Google, we often classify batch operations (for example, background photo uploads) as "non-critical retryable." This signals that a request is not directly user-facing and that the user generally doesn't mind if the handling is delayed several minutes, or even an hour. In this case, the system can easily drop the request and tell the client to re-attempt the upload later. As long as the retry interval is quite large, the overall volume of retries remains low, while still allowing clients to resume uploading once the system load crisis is over.

We’ve had several painful experiences where a rogue client was using the same hostname as many good clients, making it impossible to block the rogue without affecting the good clients as well. Now, in situations where a public HTTP-based infrastructure service serves many different kinds of clients, every type of client accesses the service through its own hostname. This allows us to isolate all traffic from a badly-behaved client and route it to more distant servers with spare capacity. While this may increase latency for these bad clients, it spares other client types. Alternatively we can designate a subset of servers to handle the bad client traffic as best as they can, accepting that they'll likely become overloaded, and keep traffic from other clients away from those hosts. There’s also the last-ditch option of simply black-holing the bad traffic.


Criticality changes over time

Opportunity costs seldom follow a straight line, and what’s critical now, might not be later. Over time, a request might move from one criticality bucket to another. Take for example, loading your front page.

At first, the request to load your front page is very valuable because it’s serving important content (perhaps ads) to your user. After a certain amount of waiting, say 2 seconds, the user will probably abandon the slow page and go someplace else. That means from 0.0 second until 1.9 seconds the request to load your front page might be in your highest criticality bucket. Once it hits 2.0 seconds, however, you might as well drop it to the lowest bucket (or cancel it altogether), because the user probably isn’t there anymore.

For this reason, a great source of load that you can shed cheaply is requests that are exceeding their response deadlines, as established by user interface data and design. The better tuned your deadlines, the cheaper this will be.

Soft quotas vs. hard quotas


Suppose your system has a total serving capacity of 1,000,000 QPS and you average 10,000 simultaneous users at peak. In order to protect yourself from particularly demanding users you decide to cap each client at 100 QPS. This cap is called a quota.

The problem, of course, with giving each client a hard quota of 100 QPS is that when you have fewer than 10,000 clients hitting your backends, you have idle capacity. Wasted capacity can never be recovered (at least without the aid of a time machine) so you should avoid that at all costs. An important principle we follow inside Google is work conservation  which can be stated as clients who have exceeded their quotas should not be throttled if the system has remaining capacity.

In our example the 100 QPS quota is a soft quota because it shouldn’t necessarily be enforced if the system can tolerate the extra load. A hard quota, on the other hand, is a limitation that cannot ever be exceeded under any circumstances. Hard quotas exist to protect your infrastructure while soft quotas exist to help you manage finite resources “equitably”  however that’s defined in your business.

This brings us to another important consideration: fairness. When the system runs out of capacity then the clients who are most over their quotas should be the first to be throttled. If user X is sending 150 QPS to the system and user Y is sending 500 QPS, it might be unfair to squash user X until user Y has had 350 QPS load-shed.

Optimistic and pessimistic throttling


Having decided which traffic to throttle, you still need to decide how to throttle. Your two basic choices are optimistic and pessimistic throttling.

Optimistic throttling just means that you don’t start dropping traffic until you reach global capacity. When this happens, the load shedding system starts working its way through requests, beginning with the least important items and working back up the stack until things are healthy again. The advantage to this approach is that it’s pretty easy to implement and relatively computationally “cheap” because you’re not reacting until you get close to your global limit.

The downside of optimistic throttling, however, is that you'll spike over your global maximum while you start shedding load. Most users will only experience this momentary overload in the form of slightly higher latency, and for this reason, this is our recommended approach for a majority of systems.

If you do choose to go down the optimistic throttling path, it’s super important to thoroughly test your logic. With this approach, there’s a risk that your active load shedding may break due to a coding error in one of your binary releases, and you may not notice it for several weeks until you hit a peak that triggers load shedding and your servers start to segfault. Not that this has ever happened to us . . . ;-)

Pessimistic throttling, on the other hand, assumes that you may not exceed your global maximum under any circumstance  not even for a very short period of time. This is a more computationally expensive approach because the load management system has to continually compute (and recompute) quotas and other limits and transmit them throughout your system. This almost always means that you never quite serve up to your global capacity. And even when you do, the additional computational load eats into capacity that would otherwise be serving capacity. For these reasons, pessimistic throttling is more difficult and costlier to implement and maintain.

Throttling as a signal


If you're the owner of a system that has started to throttle some of its traffic, what does that tell you?

The naive interpretation is that you have a problem that you need to fix, and the simplest approach is to add more capacity in the system. For example, you could turn up more servers, or add resources to the ones you already have. However, if you’re spending 20% more to keep 20% more servers up and running, but the extra capacity is only used for a few minutes at peak every day, this isn’t a good use of resources.

Instead, look at the effects of throttling in terms of user experience and revenue. Are real users seeing errors or service degradation as a result? If so, what fraction of active users are affected? How many revenue-related requests are being throttled? How much is this costing you, compared to the cost of providing extra compute resource to serve those requests?

In many cases, if the system is only throttling non-interactive retryable requests, then your system is probably working as intended. As long as the throttling period is not prolonged and the retries are completing within your processing SLO there’s no real reason to spend more money to serve them more promptly. That said, if your service is throttling traffic for 12 hours every day, it may be time to do something about its capacity.

Analyzing the impact of throttling should be relatively easy to perform because you’re already classifying your requests into buckets, while monitoring tools can show you what fraction of each bucket’s requests was throttled.

Case study


Google once ran a service with many millions of mobile clients that cached users’ state on their mobile devices about images that were incrementally uploaded (in the background) to a backend storage service. The service was designed to handle peak global traffic, plus an additional margin, with the assumption that two serving locations could be unavailable at any one time. The service also handled a significant amount of interactive (user-facing) traffic.

We identified this service as a candidate for load shedding, and implemented it by marking requests with a new "request priority" field, with values ranging from "critical user-facing" to "non-critical background" (background uploads). The service was set to automatically shed requests once it reached its predetermined maximum capacity, starting with the lowest priority and working its way up.

Two days after the load shedding code made it to production, a new app release was pushed to the clients that had the unfortunate side effect of resetting the record of which data had already been uploaded. This bug made all the clients try to connect to our service at once to re-upload all their data. You can see the upload failure rate in the following graph:
This is not a graph you want to see if you’re the SRE on call. But the system continued to serve traffic correctly. Load shedding saved it from becoming overloaded by dropping nearly half of all background upload requests, while the remaining clients patiently backed off and retried again later. After a couple of hours, we turned up enough additional capacity to handle the load, the clients uploaded their data, and things went back to normal. (The short spike in server errors is an artefact of the way we disabled the throttling once the new capacity was in place.) In short, load shedding provided defense-in-depth against an irreversible coding bug.

Wrapping up


We all want to build systems whose popularity exceeds our wildest dreams. In thinking about those cases, however, we too often dismiss them by saying “that’s a problem I’d like to have!” In our experience, these are only problems you want to have until you have them. Then they’re just problems  and painful ones at that.

Reliability is your most important feature and you want your application to be insanely popular. Load shedding is a cheap way to design with that success in mind. Build it in early and you’ll spare yourself the agony of pondering what might have been.

This is our last CRE post of 2016. We hope all of you have a wonderful holiday season and thank you for the wonderful comments and suggestions. We’ll see you again in the new year. Until then: May your queries flow and your pagers stay silent . . .

How to avoid a self-inflicted DDoS Attack – CRE life lessons



Editor’s note: Left unchecked, poor software architecture decisions are the most common cause of application downtime. Over the years, Google Site Reliability Engineering has learned to spot code that could lead to outages, and strives to identify it before it goes into production as part of its production readiness review. With the introduction of Customer Reliability Engineering, we’re taking the same best practices we’ve developed for internal systems, and extending them to customers building applications on Google Cloud Platform. This is the first post in a series written by CREs to highlight real-world problems  and the steps we take to avoid them.


Distributed Denial of Service (DDoS) attacks aren’t anything new on the internet, but thanks to a recent high profile event, they’ve been making fresh headlines. We think it’s a convenient moment to remind our readers that the biggest threat to your application isn’t from some shadowy third party, but from your own code!

What follows is a discussion of one of the most common software architecture design fails  the self-inflicted DDoS  and three methods you can use to avoid it in your own application.

Even distributions that aren’t

There’s a famous saying (variously attributed to Mark Twain, Will Rogers, and others) that goes:
“It ain’t what we don’t know that hurts us so much as the things we know that just ain’t so.”
Software developers make all sorts of simplifying assumptions about user interactions, especially about system load. One of the more pernicious (and sometimes fatal) simplifications is “I have lots of users all over the world. For simplicity, I’m going to assume their load will be evenly distributed.”

To be sure, this often turns out to be close enough to true to be useful. The problem is that it’s a steady state or static assumption. It presupposes that things don’t vary much over time. That’s where things start to go off the rails.

Consider this very common pattern: Suppose you’ve written a mobile app that periodically fetches information from your backend. Because the information isn’t super time sensitive, you write the client to sync every 15 minutes. Of course, you don’t want a momentary hiccup in network coverage to force you to wait an extra 15 minutes for the information, so you also write your app to retry every 60 seconds in the event of an error.

Because you're an experienced and careful software developer, your system consistently maintains 99.9% availability. For most systems that’s perfectly acceptable performance but it also means in any given 30-day month your system can be unavailable for up to 43.2 minutes.

So. Let’s talk about what happens when that’s the case. What happens if your system is unavailable for just one minute?

When your backends come back online you get (a) the traffic you would normally expect for the current minute, plus (b) any traffic from the one-minute retry interval. In other words, you now have 2X your expected traffic. Worse still, your load is no longer evenly distributed because 2/15ths of your users are now locked together into the same sync schedule. Thus, in this state, for any given 15-minute period you'll experience normal load for 13 minutes, no load for one minute and 2X load for one minute.

Of course, service disruptions usually last longer than just one minute. If you experience a 15-minute error (still well within your 99.9% availability) then all of your load will be locked together until after your backends recover. You'll need to provision at least 15X of your normal capacity to keep from falling over. Retries will also often “stack” at your load balancers and your backends will respond more slowly to each request as their load increases. As a result, you might easily see 20X your normal traffic (or more) while your backends heal. In the worst case, the increased load might cause your servers to run out of memory or other resources and crash again.

Congratulations, you’ve been DDoS’d by your own app!

The great thing about known problems is that they usually have known solutions. Here are three things you can do to avoid this trap.

#1 Try exponential backoff

When you use a fixed retry interval (in this case, one minute) you pretty well guarantee that you'll stack retry requests at your load balancer and cause your backends to become overloaded once they come back up. One of the best ways around this is to use exponential backoff.

In its most common form, exponential backoff simply means that you double the retry interval up to a certain limit to lower the number of overall requests queued up for your backends. In our example, after the first one-minute retry fails, wait two minutes. If that fails, wait four minutes and keep doubling that interval until you get to whatever you’ve decided is a reasonable cap (since the normal sync interval is 15 minutes you might decide to cap the retry backoff at 16 minutes).

Of course, backing off of retries will help your overall load at recovery but won’t do much to keep your clients from retrying in sync. To solve that problem, you need jitter.


#2 Add a little jitter


Jitter is the random interval you add (or subtract) to the next retry interval to prevent clients from locking together during a prolonged outage. The usual pattern is to pick a random number between +/- a fixed percentage, say 30%, and add it to the next retry interval.

In our example, if the next backoff interval is supposed to be 4 minutes, then wait between +/- 30% of that interval. Thirty percent of 4 minutes is 1.2 minutes, so select a random value between 2.8 minutes and 5.2 minutes to wait.

Here at Google we’ve observed the impact of a lack of jitter in our own services. We once built a system where clients started off polling at random times but we later observed that they had a strong tendency to become more synchronized during short service outages or degraded operation.

Eventually we saw very uneven load across a poll interval  with most clients polling the service at the same time  resulting in peak load that was easily 3X the average. Here’s a graph from the postmortem from an outage in the aforementioned system. In this case the clients were polling at a fixed 5-minute interval, but over many months became synchronized:

Observe how the traffic (red) comes in periodic spikes, correlating with 2x the average backend latency (green) as the servers become overloaded. That was a sure sign that we needed to employ jitter. This monitoring view is also significantly under-counting the traffic peaks because of its sample interval. Once we added a random factor of +/- 1 minute (20%) to each retry the latency, traffic flattened out almost immediately, with the periodicity disappearing:
and the backends were no longer overloaded. Of course, we couldn’t do this immediately  we had to build and push a new code release to our clients with this new behavior, so we had to live with this overload for a while.

At this point, we should also point out that in the real world, usage is almost never evenly distributed  even when the users are. Nearly all systems of any scale experience peaks and troughs corresponding with the work and sleep habits of their users. Lots of people simply turn off their phones or computers when they go to sleep. That means that you'll see a spike in traffic as those devices come back online when people wake up.

For this reason it’s also a really good idea to add a little jitter (perhaps 10%) to regular sync intervals, in addition to your retries. This is especially important for first syncs after an application starts. This will help to smooth out daily cyclical traffic spikes and keep systems from becoming overloaded.

#3 Implement retry marking

A large fleet of backends doesn’t recover from an outage all at once. That means that as a system begins to come back online, its overall capacity ramps up slowly. You don’t want to jeopardize that recovery by trying to serve all of your waiting clients at once. Even if you implement both exponential backoff and jitter you still need to prioritize your requests as you heal.

An easy and effective technique to do this is to have your clients mark each attempt with a retry number. A value of zero means that the request is a regular sync. A value of one indicates the first retry and so on. With this in place, the backends can prioritize which requests to service and which to ignore as things get back to normal. For example, you might decide that higher retry numbers indicate users who are further out-of-sync and service them first. Another approach is to cap the overall retry load to a fixed percentage, say 10%, and service all the regular syncs and only 10% of the retries.

How you choose to handle retries is entirely up to your business needs. The important thing is that by marking them you have the ability to make intelligent decisions as a service recovers.

You can also monitor the health of your recovery by watching the retry number metrics. If you’re recovering from a six-minute outage, you might see that the oldest retries have a retry sequence number of 3. As you recover, you would expect to see the number of 3s drop sharply, followed by the 2s, and so on. If you don’t see that (or see the retry sequence numbers increase), you know you still have a problem. This would not be obvious by simply watching the overall number of retries.

Parting thoughts

Managing system load and gracefully recovering from errors is a deep topic. Stay tuned for upcoming posts about important subjects like cascading failures and load shedding. In the meantime, if you adopt the techniques in this article you can help keep your one minute network blip from turning into a much longer DDoS disaster.