Site Reliability Engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap. In the past, some have called SRE a competing set of practices to DevOps. But we think they're not so different after all.
What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.
1. The difference between DevOps and SREIt’s useful to start by understanding the differences and similarities between SRE and DevOps to lay the groundwork for future conversation.
The DevOps movement began because developers would write code with little understanding of how it would run in production. They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group's priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. However, the DevOps movement does not explicitly define how to succeed in these areas. In this way, DevOps is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.
SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table below illustrates the five DevOps pillars and the corresponding SRE practices:
|Reduce organization silos||Share ownership with developers by using the same tools and techniques across the stack|
|Accept failure as normal||Have a formula for balancing accidents and failures against new releases|
|Implement gradual change||Encourage moving quickly by reducing costs of failure|
|Leverage tooling & automation||Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system|
|Measure everything||Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.|
If you think of DevOps like an interface in a programming language, class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface.
DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. If you prefer books, check out How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) for a more thorough explanation.
2. SLIs, SLOs, and SLAsThe SRE discipline collaboratively decides on a system's availability targets and measures availability with input from engineers, product owners and customers.
It can be challenging to have a productive conversation about software development without a consistent and agreed-upon way to describe a system's uptime and availability. Operations teams are constantly putting out fires, some of which end up being bugs in developer's code. But without a clear measurement of uptime and a clear prioritization on availability, product teams may not agree that reliability is a problem. This very challenge affected Google in the early 2000s, and it was one of the motivating factors for developing the SRE discipline.
SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- SLIs are metrics over time such as request latency, throughput of requests per second, or failures per request. These are usually aggregated over time and then converted to a rate, average or percentile subject to a threshold.
- SLOs are targets for the cumulative success of SLIs over a window of time (like "last 30 days" or "this quarter"), agreed-upon by stakeholders
The video also discusses Service Level Agreements (SLAs). Although not specifically part of the day-to-day concerns of SREs, an SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service. SLAs are usually defined and negotiated by account executives for customers and offer a lower availability than the SLO. After all, you want to break your own internal SLO before you break a customer-facing SLA.
SLIs, SLOs and SLAs tie back closely to the DevOps pillar of "measure everything" and one of the reasons we say class SRE implements DevOps.
3. Risk and error budgetsWe focus here on measuring risk through error budgets, which are quantitative ways in which SREs collaborate with product owners to balance availability and feature development. This video also discusses why 100% is not a viable availability target.
Maximizing a system's stability is both counterproductive and pointless. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won't notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components like ISPs, cellular networks or WiFi. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners focused on reliability can choose a higher SLO, but accept that breaking that SLO will delay feature releases. The SRE discipline quantifies this acceptable risk as an "error budget." When error budgets are depleted, the focus shifts from feature development to improving reliability.
As mentioned in the second video, leadership buy-in is an important pillar in the SRE discipline. Without this cooperation, nothing prevents teams from breaking their agreed-upon SLOs, forcing SREs to work overtime or waste too much time toiling to just keep the systems running. If SRE teams do not have the ability to enforce error budgets (or if the error budgets are not taken seriously), the system fails.
Risk and error budgets quantitatively accept failure as normal and enforce the DevOps pillar to implement gradual change. Non-gradual changes risk exceeding error budgets.
4. Toil and toil budgetsAn important component of the SRE discipline is toil, toil budgets and ways to reduce toil. Toil occurs each time a human operator needs to manually touch a system during normal operations—but the definition of "normal" is constantly changing.
Toil is not simply "work I don't like to do." For example, the following tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows. Each time an operator needs to touch a system, such as responding to a page, working a ticket or unsticking a process, toil has likely occurred.
The SRE discipline aims to reduce toil by focusing on the "engineering" component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. While minimizing toil is important, it's realistically impossible to completely eliminate. Google aims to ensure that at least 50% of each SRE's time is spent doing engineering projects, and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. That being said, toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. Long-term toil assignments, however, quickly outweigh the benefits and can cause career stagnation.
Toil and toil budgets are closely related to the DevOps pillars of "measure everything" and "reduce organizational silos."
5. Customer Reliability Engineering (CRE)Finally, Customer Reliability Engineering (CRE) completes the tenets of SRE (with the help in the video of a futuristic friend). CRE aims to teach SRE practices to customers and service consumers.
In the past, Google did not talk publicly about SRE. We thought of it as a competitive advantage we had to keep secret from the world. However, every time a customer had a problem because they used a system in an unexpected way, we had to stop innovating and help solve the problem. That tiny bit of friction, spread across billions of users, adds up very quickly. It became clear that we needed to start talking about SRE publicly and teaching our customers about SRE practices so they could replicate them within their organizations.
Thus, in 2016, we launched the CRE program as both a means of helping our Google Cloud Platform (GCP) customers with improving their reliability, and a means of exposing Google SREs directly to the challenges customers face. The CRE program aims to reduce customer anxiety by teaching them SRE principles and helping them adopt SRE practices.
CRE aligns with the DevOps pillars of "reduce organization silos" by forcing collaboration across organizations, and it also closely relates to the concepts of "accepting failure as normal" and "measure everything" by creating a shared responsibility among all stakeholders in the form of shared SLOs.