Tag Archives: Developer Tools & Insights

Improving application availability with Alias IPs, now with hot standby

High availability and redundancy are essential features for a cloud deployment. On Google Cloud Platform (GCP), Alias IPs allow you to configure secondary IPs or IP ranges on your virtual machine (VM) instances, for a secure and highly scalable way to deliver traffic to your applications. Today, we’re excited to announce that you can now dynamically add and remove alias IP ranges for existing, running VMs, so that you can migrate your applications from one VM to another in the event of a software or machine failure.

In the past, you could deploy highly available applications on GCP by using static routes that point to a hosted Virtual IP (VIP), plus adjusting the next-hop VM of that VIP destination based on availability of the application hosting the VM. Now, Alias IP ranges support hot standby availability deployments, including multiple standbys for a single VM (one-to-many), as well as multiple standbys for multiple VMs (many-to-many).
With this native support, you can now rely on GCP’s IP address management capabilities to carve out flexible IP ranges for your VMs. This delivers the following benefits over high-availability solutions that use static routes:
  • Improved security: Deployments that use Alias IP allow us to apply anti-spoofing checks that validate the source and destination IP, and allow any traffic with any source or destination to be forwarded. In contrast, static routes require that you disable anti-spoof protection for a VM.
  • Connectivity through VPN / Google Cloud Interconnect: Highly available application VIPs implemented as Alias IP addresses can be announced by Cloud Router via BGP to an on-premises network connected via VPN or Cloud Interconnect. This is important if you are accessing the highly available application from your on-premises data center.
  • Native access to Google services like Google Cloud Storage, BigQuery and any other managed services from googleapis.com. By using Alias IP, highly available applications get native access to these services, avoiding bottlenecks created by an external NAT proxy.
Let’s take a look at how you can configure floating Alias IPs.

Imagine you need to configure a highly available application that requires machine state to be constantly synced, for example between a database master and slave running on VMs in your network. Using Internal Load Balancing doesn’t help here since the traffic needs to be sent to only one server. With Alias IPs, you can configure your database to run using secondary IPs on the VM's primary network interface. In the event of a failure, you can dynamically switch this IP to be removed from the bad VM and attach it to the new server.

This approach is also be useful if an application in your virtual network needs to be accessed across regions, since Internal Load Balancing currently only supports only in-region access.

You can use Alias IP from the gcloud command line interface.

To migrate from VM-A to VM-B
a) Remove the IP from VM-A

gcloud compute instances network-interfaces update \
     virtual-machine-a --zone us-central1-a --aliases 
b) Add the IP to VM-B

gcloud compute instances network-interfaces update \
     virtual-machine-b --zone us-central1-a \
     --aliases “range1:”
In addition to adding and removing alias IPs from running VMs, you can create up to 10 Alias IP ranges per network interface, including up to seven secondary interfaces attached to other networks.
You can also use Alias IP with applications running within containers and being managed by container orchestration systems such as Kubernetes or Mesos. Click here to learn more about how Kubernetes uses Alias IPs.

Being able to migrate your workloads while they are running goes a long way toward ensuring high availability for your applications. Drop us a line about how you use Alias IPs, and other networking features you’d like to see on GCP.

Introducing ultramem Google Compute Engine machine types

Today we are excited to announce beta availability of a new family of Google Compute Engine machine types. The n1-ultramem family of memory-optimized virtual machine (VM) instances come with more memory—a lot more! In fact, these machine types offer more compute resources and more memory than any other VM instance that we offer, making Compute Engine a great option for a whole new range of demanding, enterprise-class workloads.

The n1-ultramem machine type allows you to provision VMs with up to 160 vCPUs and nearly 4TB of RAM. The new memory-optimized, n1-ultramem family of machine types are powered by 4 Intel® Xeon® Processor E7-8880 v4 (Broadwell) CPUs and DDR4 memory, so they are ready for your most critical enterprise applications. They come in three predefined sizes:
  • n1-ultramem-40: 40 vCPUs and 961 GB of memory
  • n1-ultramem-80: 80 vCPUs and 1922 GB of memory
  • n1-ultramem-160: 160 vCPUs and 3844 GB of memory
These new machine types expand the breadth of the Compute Engine portfolio with new price-performance options. Now, you can provision compute capacity that fits your exact hardware and budget requirements, while paying only for the resources you use. These VMs are a cost-effective option for memory-intensive workloads, and provide you with the lowest $/GB of any Compute Engine machine type. For full details on machine type pricing, please check the pricing page, or the pricing calculator.

Memory-optimized machine types are well suited for enterprise workloads that require substantial vCPU and system memory, such as data analytics, enterprise resource planning, genomics, and in-memory databases. They are also ideal for many resource-hungry HPC applications.

Incorta is a cloud-based data analytics provider, and has been testing out the n1-ultramem-160 instances to run its in-memory database.
"Incorta is very excited about the performance offered by Google Cloud Platform's latest instances. With nearly 4TB of memory, these high-performance systems are ideal for Incorta's Direct Data Mapping engine which aggregates complex business data in real-time without the need to reshape any data. Using public data sources and Incorta's internal testing, we've experienced queries of three billion records in under five seconds, compared to three to seven hours with legacy systems."
— Osama Elkady, CEO, Incorta
In addition, the n1-ultramem-160 machine type, with nearly 4TB of RAM, is a great fit for the SAP HANA in-memory database. If you’ve delayed moving to the cloud because you have not been able to find big enough instances for your SAP HANA implementation, take a look at Compute Engine. Now you don’t need to keep your database on-premises while your apps move to cloud. You can run both your application and in-memory database in Google Cloud Platform where SAP HANA backend applications will benefit from the ultra-low latency of running alongside the in-memory database.

You can currently launch ultramem VMs in us-central1, us-east1 and europe-west1. Stay up-to-date on additional regions by visiting our available regions and zones page.

Visit the Google Cloud Platform Console and get started today. It’s easy to configure and provision n1-ultramem machine types programmatically, as well as via the console. Visit our SAP page, if you’d like to learn more about running your SAP HANA, in-memory database on GCP with ultramem machine types.

Increase performance while reducing costs with the new App Engine scheduler

One of the main benefits of Google App Engine is automatic scaling of your applications. Behind the scenes, App Engine continually monitors your instance capacity and traffic to ensure the appropriate number of instances are running. Today, we are rolling out the next generation scheduler for App Engine standard environment. Our tests show that it delivers better scaling performance and more efficient resource consumption—and lower costs for you.

The new App Engine scheduler delivers the following improvements compared to the previous App Engine scheduler:

  • an average of 5% reduction in median and tail request latencies
  • an average of 30% reduction of the number of requests seeing a "cold start"
  • an average of 7% cost reduction

Observed improvements across all App Engine services and customers: blue is the baseline (old scheduler), green is the new scheduler.

In addition, if you need more control over how App Engine runs your applications, the new scheduler introduces some new autoscaling parameters. For example:

  • Max Instances allows you to cap the total number of instances, and
  • Target CPU Utilization represents the CPU utilization ratio threshold used to determine if the number of instances should be scaled up or down. Tweak this parameter to optimize between performance and costs.

For a complete list of the parameters you can use to configure your App Engine app, visit the app.yaml reference documentation.

The new scheduler for App Engine standard environment is generally available and has been rolled out to all regions and all applications. We are very excited about the improvements it brings.

You can read more about the new feature in the App Engine documentation. And if you have concerns or are encountering issues, reach out to us via GCP Support, by reporting a public issue, posting in the App Engine forum, or messaging us on the App Engine slack channel. We look forward to your feedback!

Kubernetes best practices: Resource requests and limits

Editor’s note: Today is the fourth installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most out of your Kubernetes environment.

When Kubernetes schedules a Pod, it’s important that the containers have enough resources to actually run. If you schedule a large application on a node with limited resources, it is possible for the node to run out of memory or CPU resources and for things to stop working!

It’s also possible for applications to take up more resources than they should. This could be caused by a team spinning up more replicas than they need to artificially decrease latency (hey, it’s easier to spin up more copies than make your code more efficient!), to a bad configuration change that causes a program to go out of control and use 100% of the available CPU. Regardless of whether the issue is caused by a bad developer, bad code, or bad luck, what’s important is that you be in control.

In this episode of Kubernetes best practices, let’s take a look at how you can solve these problems using resource requests and limits.

Requests and Limits

Requests and limits are the mechanisms Kubernetes uses to control resources such as CPU and memory. Requests are what the container is guaranteed to get. If a container requests a resource, Kubernetes will only schedule it on a node that can give it that resource. Limits, on the other hand, make sure a container never goes above a certain value. The container is only allowed to go up to the limit, and then it is restricted.

It is important to remember that the limit can never be lower than the request. If you try this, Kubernetes will throw an error and won’t let you run the container.

Requests and limits are on a per-container basis. While Pods usually contain a single container, it’s common to see Pods with multiple containers as well. Each container in the Pod gets its own individual limit and request, but because Pods are always scheduled as a group, you need to add the limits and requests for each container together to get an aggregate value for the Pod.

To control what requests and limits a container can have, you can set quotas at the Container level and at the Namespace level. If you want to learn more about Namespaces, see this previous installment from our blog series!

Let’s see how these work.

Container settings

There are two types of resources: CPU and Memory. The Kubernetes scheduler uses these to figure out where to run your pods.

Here are the docs for these resources.

If you are running in Google Kubernetes Engine, the default Namespace already has some requests and limits set up for you.

These default settings are okay for “Hello World”, but it is important to change them to fit your app.

A typical Pod spec for resources might look something like this. This pod has two containers:

Each container in the Pod can set its own requests and limits, and these are all additive. So in the above example, the Pod has a total request of 500 mCPU and 128 MiB of memory, and a total limit of 1 CPU and 256MiB of memory.


CPU resources are defined in millicores. If your container needs two full cores to run, you would put the value “2000m”. If your container only needs ¼ of a core, you would put a value of “250m”.

One thing to keep in mind about CPU requests is that if you put in a value larger than the core count of your biggest node, your pod will never be scheduled. Let’s say you have a pod that needs four cores, but your Kubernetes cluster is comprised of dual core VMs—your pod will never be scheduled!

Unless your app is specifically designed to take advantage of multiple cores (scientific computing and some databases come to mind), it is usually a best practice to keep the CPU request at ‘1’ or below, and run more replicas to scale it out. This gives the system more flexibility and reliability.

It’s when it comes to CPU limits that things get interesting. CPU is considered a “compressible” resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. This means the CPU will be artificially restricted, giving your app potentially worse performance! However, it won’t be terminated or evicted. You can use a liveness health check to make sure performance has not been impacted.


Memory resources are defined in bytes. Normally, you give a mebibyte value for memory (this is basically the same thing as a megabyte), but you can give anything from bytes to petabytes.

Just like CPU, if you put in a memory request that is larger than the amount of memory on your nodes, the pod will never be scheduled.

Unlike CPU resources, memory cannot be compressed. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated. If your pod is managed by a Deployment, StatefulSet, DaemonSet, or another type of controller, then the controller spins up a replacement.


It is important to remember that you cannot set requests that are larger than resources provided by your nodes. For example, if you have a cluster of dual-core machines, a Pod with a request of 2.5 cores will never be scheduled! You can find the total resources for Kubernetes Engine VMs here.

Namespace settings

In an ideal world, Kubernetes’ Container settings would be good enough to take care of everything, but the world is a dark and terrible place. People can easily forget to set the resources, or a rogue team can set the requests and limits very high and take up more than their fair share of the cluster.

To prevent these scenarios, you can set up ResourceQuotas and LimitRanges at the Namespace level.


After creating Namespaces, you can lock them down using ResourceQuotas. ResourceQuotas are very powerful, but let’s just look at how you can use them to restrict CPU and Memory resource usage.

A Quota for resources might look something like this:

Looking at this example, you can see there are four sections. Configuring each of these sections is optional.

requests.cpu is the maximum combined CPU requests in millicores for all the containers in the Namespace. In the above example, you can have 50 containers with 10m requests, five containers with 100m requests, or even one container with a 500m request. As long as the total requested CPU in the Namespace is less than 500m!

requests.memory is the maximum combined Memory requests for all the containers in the Namespace. In the above example, you can have 50 containers with 2MiB requests, five containers with 20MiB CPU requests, or even a single container with a 100MiB request. As long as the total requested Memory in the Namespace is less than 100MiB!

limits.cpu is the maximum combined CPU limits for all the containers in the Namespace. It’s just like requests.cpu but for the limit.

limits.memory is the maximum combined Memory limits for all containers in the Namespace. It’s just like requests.memory but for the limit.

If you are using a production and development Namespace (in contrast to a Namespace per team or service), a common pattern is to put no quota on the production Namespace and strict quotas on the development Namespace. This allows production to take all the resources it needs in case of a spike in traffic.


You can also create a LimitRange in your Namespace. Unlike a Quota, which looks at the Namespace as a whole, a LimitRange applies to an individual container. This can help prevent people from creating super tiny or super large containers inside the Namespace.

A LimitRange might look something like this:

Looking at this example, you can see there are four sections. Again, setting each of these sections is optional.

The default section sets up the default limits for a container in a pod. If you set these values in the limitRange, any containers that don’t explicitly set these themselves will get assigned the default values.

The defaultRequest section sets up the default requests for a container in a pod. If you set these values in the limitRange, any containers that don’t explicitly set these themselves will get assigned the default values.

The max section will set up the maximum limits that a container in a Pod can set. The default section cannot be higher than this value. Likewise, limits set on a container cannot be higher than this value. It is important to note that if this value is set and the default section is not, any containers that don’t explicitly set these values themselves will get assigned the max values as the limit.

The min section sets up the minimum Requests that a container in a Pod can set. The defaultRequest section cannot be lower than this value. Likewise, requests set on a container cannot be lower than this value either. It is important to note that if this value is set and the defaultRequest section is not, the min value becomes the defaultRequest value too.

The lifecycle of a Kubernetes Pod

At the end of the day, these resources requests are used by the Kubernetes scheduler to run your workloads. It is important to understand how this works so you can tune your containers correctly.

Let’s say you want to run a Pod on your Cluster. Assuming the Pod specifications are valid, the Kubernetes scheduler will use round-robin load balancing to pick a Node to run your workload.

Note: The exception to this is if you use a nodeSelector or similar mechanism to force Kubernetes to schedule your Pod in a specific place. The resource checks still occur when you use a nodeSelector, but Kubernetes will only check nodes that have the required label.

Kubernetes then checks to see if the Node has enough resources to fulfill the resources requests on the Pod’s containers. If it doesn’t, it moves on to the next node.

If none of the Nodes in the system have resources left to fill the requests, then Pods go into a “pending” state. By using Kubernetes Engine features such as the Node Autoscaler, Kubernetes Engine can automatically detect this state and create more Nodes automatically. If there is excess capacity, the autoscaler can also scale down and remove Nodes to save you money!

But what about limits? As you know, limits can be higher than the requests. What if you have a Node where the sum of all the container Limits is actually higher than the resources available on the machine?

At this point, Kubernetes goes into something called an “overcommitted state.” Here is where things get interesting. Because CPU can be compressed, Kubernetes will make sure your containers get the CPU they requested and will throttle the rest. Memory cannot be compressed, so Kubernetes needs to start making decisions on what containers to terminate if the Node runs out of memory.

Let’s imagine a scenario where we have a machine that is running out of memory. What will Kubernetes do?

Note: The following is true for Kubernetes 1.9 and above. In previous versions, it uses a slightly different process. See this doc for an in-depth explanation.

Kubernetes looks for Pods that are using more resources than they requested. If your Pod’s containers have no requests, then by default they are using more than they requested, so these are prime candidates for termination. Other prime candidates are containers that have gone over their request but are still under their limit.

If Kubernetes finds multiple pods that have gone over their requests, it will then rank these by the Pod’s priority, and terminate the lowest priority pods first. If all the Pods have the same priority, Kubernetes terminates the Pod that’s the most over its request.

In very rare scenarios, Kubernetes might be forced to terminate Pods that are still within their requests. This can happen when critical system components, like the kubelet or docker, start taking more resources than were reserved for them.


While your Kubernetes cluster might work fine without setting resource requests and limits, you will start running into stability issues as your teams and projects grow. Adding requests and limits to your Pods and Namespaces only takes a little extra effort, and can save you from running into many headaches down the line!

Using Jenkins on Google Compute Engine for distributed builds

Continuous integration has become a standard practice across a lot of software development organizations, automatically detecting changes that were committed to your software repositories, running them through unit, integration and functional tests, and finally creating an artifact (JAR, Docker image, or binary). Among continuous integration tools, Jenkins is one of the most popular, and so we created the Compute Engine Plugin, helping you to provision, configure and scale Jenkins build environments on Google Cloud Platform (GCP).

With Jenkins, you define your build and test process, then run it continuously against your latest software changes. But as you scale up your continuous integration practice, you may need to run builds across fleets of machines rather than on a single server. With the Compute Engine Plugin, your DevOps teams can intuitively manage instance templates and launch build instances that automatically register themselves with Jenkins. When Jenkins needs to run jobs but there aren’t enough available nodes, it provisions instances on-demand based on your templates. Once work in the build system has slowed down, the plugin automatically deletes your unused instances, so that you only pay for the instances you need. This autoscaling functionality is an important feature of a continuous build system, which gets a lot of use during primary work hours, and less when developers are off enjoying themselves. For further cost savings, you can also configure the Compute Engine Plugin to create your build instances as Preemptible VMs, which can save you up to 80% on per-second pricing of your builds.

Security is another concern with continuous integration systems. A compromise of this key organizational system can put the integrity of your software at risk. The Compute Engine Plugin uses the latest and most secure version of the Jenkins Java Network Launch Protocol (JNLP) remoting protocol. When bootstrapping the build instances, the Compute Engine Plugin creates a one-time SSH key and injects it into each build instance. That way, the impact of those credentials being compromised is limited to a single instance.

The Compute Engine Plugin lets you configure your build instances how you like them, including the networking. For example, you can:

  • Disable external IPs so that worker VMs are not publicly accessible
  • Use Shared VPC networks for greater isolation in your GCP projects
  • Apply custom network tags for improved placement in firewall rules

The plugin also allows you to attach accelerators like GPUs and Local SSDs to your instances to run your builds faster. You can also configure the plugin to use our wide variety of machine types which match the CPU and memory requirements of your build instance to the workload, for better utilization. Finally, the plugin allows you to configure arbitrary startup scripts for your instance templates, where you can do the final configuration of your base images before your builds are run.

If you use Jenkins on-premises, you can use the Compute Engine Plugin to create an ephemeral build farm in Compute Engine while keeping your Jenkins master and other necessary build dependencies behind your firewall. You can then use this extension of your build farm when you can’t meet demand for build capacity, or as a way to transition your workloads to the cloud in a practical and low-risk way.

Here is an example of the configuration page for an instance template:

Below is a high-level architecture of a scalable build system built with the Jenkins Compute Engine and Google Cloud Storage plugins. The Jenkins administrator configures an IAM service account that Jenkins uses to provision your build instances. Once builds are run, it can upload artifacts to Cloud Storage to archive them (and move them to cheaper storage after a given time threshold).
Jenkins and continuous integration are powerful tools for modern software development shops, and we hope this plugin makes it easier for you to use Jenkins on GCP. For instructions on getting this set up in your Google Cloud project, follow our solution guide.

SRE vs. DevOps: competing standards or close friends?

Site Reliability Engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap. In the past, some have called SRE a competing set of practices to DevOps. But we think they're not so different after all.

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

1. The difference between DevOps and SRE

It’s useful to start by understanding the differences and similarities between SRE and DevOps to lay the groundwork for future conversation.

The DevOps movement began because developers would write code with little understanding of how it would run in production. They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group's priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. However, the DevOps movement does not explicitly define how to succeed in these areas. In this way, DevOps is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.

SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table below illustrates the five DevOps pillars and the corresponding SRE practices:

DevOps SRE
Reduce organization silos Share ownership with developers by using the same tools and techniques across the stack
Accept failure as normal Have a formula for balancing accidents and failures against new releases
Implement gradual change Encourage moving quickly by reducing costs of failure
Leverage tooling & automation Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system
Measure everything Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.

If you think of DevOps like an interface in a programming language, class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface.

DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. If you prefer books, check out How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) for a more thorough explanation.

2. SLIs, SLOs, and SLAs

The SRE discipline collaboratively decides on a system's availability targets and measures availability with input from engineers, product owners and customers.

It can be challenging to have a productive conversation about software development without a consistent and agreed-upon way to describe a system's uptime and availability. Operations teams are constantly putting out fires, some of which end up being bugs in developer's code. But without a clear measurement of uptime and a clear prioritization on availability, product teams may not agree that reliability is a problem. This very challenge affected Google in the early 2000s, and it was one of the motivating factors for developing the SRE discipline.

SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

  • SLIs are metrics over time such as request latency, throughput of requests per second, or failures per request. These are usually aggregated over time and then converted to a rate, average or percentile subject to a threshold.
  • SLOs are targets for the cumulative success of SLIs over a window of time (like "last 30 days" or "this quarter"), agreed-upon by stakeholders

The video also discusses Service Level Agreements (SLAs). Although not specifically part of the day-to-day concerns of SREs, an SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service. SLAs are usually defined and negotiated by account executives for customers and offer a lower availability than the SLO. After all, you want to break your own internal SLO before you break a customer-facing SLA.

SLIs, SLOs and SLAs tie back closely to the DevOps pillar of "measure everything" and one of the reasons we say class SRE implements DevOps.

3. Risk and error budgets

We focus here on measuring risk through error budgets, which are quantitative ways in which SREs collaborate with product owners to balance availability and feature development. This video also discusses why 100% is not a viable availability target.

Maximizing a system's stability is both counterproductive and pointless. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won't notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components like ISPs, cellular networks or WiFi. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners focused on reliability can choose a higher SLO, but accept that breaking that SLO will delay feature releases. The SRE discipline quantifies this acceptable risk as an "error budget." When error budgets are depleted, the focus shifts from feature development to improving reliability.

As mentioned in the second video, leadership buy-in is an important pillar in the SRE discipline. Without this cooperation, nothing prevents teams from breaking their agreed-upon SLOs, forcing SREs to work overtime or waste too much time toiling to just keep the systems running. If SRE teams do not have the ability to enforce error budgets (or if the error budgets are not taken seriously), the system fails.

Risk and error budgets quantitatively accept failure as normal and enforce the DevOps pillar to implement gradual change. Non-gradual changes risk exceeding error budgets.

4. Toil and toil budgets

An important component of the SRE discipline is toil, toil budgets and ways to reduce toil. Toil occurs each time a human operator needs to manually touch a system during normal operations—but the definition of "normal" is constantly changing.

Toil is not simply "work I don't like to do." For example, the following tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows. Each time an operator needs to touch a system, such as responding to a page, working a ticket or unsticking a process, toil has likely occurred.

The SRE discipline aims to reduce toil by focusing on the "engineering" component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. While minimizing toil is important, it's realistically impossible to completely eliminate. Google aims to ensure that at least 50% of each SRE's time is spent doing engineering projects, and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. That being said, toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. Long-term toil assignments, however, quickly outweigh the benefits and can cause career stagnation.

Toil and toil budgets are closely related to the DevOps pillars of "measure everything" and "reduce organizational silos."

5. Customer Reliability Engineering (CRE)

Finally, Customer Reliability Engineering (CRE) completes the tenets of SRE (with the help in the video of a futuristic friend). CRE aims to teach SRE practices to customers and service consumers.

In the past, Google did not talk publicly about SRE. We thought of it as a competitive advantage we had to keep secret from the world. However, every time a customer had a problem because they used a system in an unexpected way, we had to stop innovating and help solve the problem. That tiny bit of friction, spread across billions of users, adds up very quickly. It became clear that we needed to start talking about SRE publicly and teaching our customers about SRE practices so they could replicate them within their organizations.

Thus, in 2016, we launched the CRE program as both a means of helping our Google Cloud Platform (GCP) customers with improving their reliability, and a means of exposing Google SREs directly to the challenges customers face. The CRE program aims to reduce customer anxiety by teaching them SRE principles and helping them adopt SRE practices.

CRE aligns with the DevOps pillars of "reduce organization silos" by forcing collaboration across organizations, and it also closely relates to the concepts of "accepting failure as normal" and "measure everything" by creating a shared responsibility among all stakeholders in the form of shared SLOs.

Looking forward with SRE

We are working on some exciting new content across a variety of mediums to help showcase how users can adopt DevOps and SRE on Google Cloud, and we cannot wait to share them with you. What SRE topics are you interested in hearing about? Please give us a tweet or watch our videos.

Defining SLOs for services with dependencies – CRE life lessons

In a previous episode of CRE Life Lessons, we discussed how service level objectives (SLOs) are an important tool for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we discuss how to define and manage SLOs for services with dependencies, each of which may (or may not!) have their own SLOs.

Any non-trivial service has dependencies. Some dependencies are direct: service A makes a Remote Procedure Call to service B, so A depends on B. Others are indirect: if B in turn depends on C and D, then A also depends on C and D, in addition to B. Still others are structurally implicit: a service may run in a particular Google Cloud Platform (GCP) zone or region, or depend on DNS or some other form of service discovery.

To make things more complicated, not all dependencies have the same impact. Outages for "hard" dependencies imply that your service is out as well. Outages for "soft" dependencies should have no impact on your service if they were designed appropriately. A common example is best-effort logging/tracing to an external monitoring system. Other dependencies are somewhere in between; for example, a failure in a caching layer might result in degraded latency performance, which may or may not be out of SLO.

Take a moment to think about one of your services. Do you have a list of its dependencies, and what impact they have? Do the dependencies have SLOs that cover your specific needs?

Given all this, how can you as a service owner define SLOs and be confident about meeting them? Consider the following complexities:

  • Some of your dependencies may not even have SLOs, or their SLOs may not capture how you're using them.
  • The effect of a dependency's SLO on your service isn't always straightforward. In addition to the "hard" vs "soft" vs "degraded" impact discussed above, your code may complicate the effect of a dependency's SLOs on your service. For example, you have a 10s timeout on an RPC, but its SLO is based on serving a response within 30s. Or, your code does retries, and its impact on your service depends on the effectiveness of those retries (e.g., if the dependency fails 0.1% of all requests, does your retry have a 0.1% chance of failing or is there something about your request that means it is more than 0.1% likely to fail again?).
  • How to combine SLOs of multiple dependencies depends on the correlation between them. At the extremes, if all of your dependencies are always unavailable at the same time, then theoretically your unavailability is based on the max(), i.e., the dependency with the longest unavailability. If they are unavailable at distinct times, then theoretically your unavailability is the sum() of the unavailability of each dependency. The reality is likely somewhere in between.
  • Services usually do better than their SLOs (and usually much better than their service level agreements), so using them to estimate your downtime is often too conservative.
At this point you may want to throw up your hands and give up on determining an achievable SLO for your service entirely. Don't despair! The way out of this thorny mess is to go back to the basics of how to define a good SLO. Instead of determining your SLO bottom-up ("What can my service achieve based on all of my dependencies?"), go top down: "What SLO do my customers need to be happy?" Use that as your SLO.

Risky business

You may find that you can consistently meet that SLO with the availability you get from your dependencies (minus your own home-grown sources of unavailability). Great! Your users are happy. If not, you have some work to do. Either way, the top-down approach of setting your SLO doesn't mean you should ignore the risks that dependencies pose to it. CRE tech lead Matt Brown gave a great talk at SRECon18 Americas about prioritizing risk (slides), including a risk analysis spreadsheet that you can use to help identify, communicate, and prioritize the top risks to your error budget (the talk expands on a previous CRE Life Lessons blog post).

Some of the main sources of risk to your SLO will of course come from your dependencies. When modeling the risk from a dependency, you can use its published SLO, or choose to use observed/historical performance instead: SLOs tend to be conservative, so using them will likely overestimate the actual risk. In some cases, if a dependency doesn't have a published SLO and you don't have historical data, you'll have to use your best guess. When modeling risk, also keep in mind the difficulties described above about mapping a dependency's SLO onto yours. If you're using the spreadsheet, you can try out different values (for example, the published SLO for a dependency versus the observed performance) and see the effect they have on your projected SLO performance.1

Remember that you're making these estimates as a tool for prioritization; they don't have to be perfectly accurate, and your estimates won't result in any guarantees. However, the process should give you a better understanding of whether you're likely to consistently meet your SLO, and if not, what the biggest sources of risk to your error budget are. It also encourages you to document your assumptions, where they can be discussed and critiqued. From there, you can do a pragmatic cost/benefit analysis to decide which risks to mitigate.

For dependencies, mitigation might mean:
  • Trying to remove it from your critical path
  • Making it more reliable; e.g., running multiple copies and failing over between them
  • Automating manual failover processes
  • Replacing it with a more reliable alternative
  • Sharding it so that the scope of failure is reduced
  • Adding retries
  • Increasing (or decreasing, sometimes it is better to fail fast and retry!) RPC timeouts
  • Adding caching and using stale data instead of live data
  • Adding graceful degradation using partial responses
  • Asking for an SLO that better meets your needs
There may be very little you can do to mitigate unavailability from a critical infrastructure dependency, or it might be prohibitively expensive. Instead, mitigate other sources of error budget burn, freeing up error budget so you can absorb outages from the dependency.

A series of earlier CRE Life Lessons posts (1, 2, 3) discussed consequences and escalations for SLO violations, as a way to balance velocity and risk; an example of a consequence might be to temporarily block new releases when the error budget is spent. If an outage was caused by one of your service's dependencies, should the consequences still apply? After all, it's not your fault, right?!? The answer is "yes"—the SLO is your proxy for your users' happiness, and users don't care whose "fault" it is. If a particular dependency causes frequent violations to your SLO, you need to mitigate the risk from it, or mitigate other risks to free up more error budget. As always, you can be pragmatic about how and when to enforce consequences for SLO violations, but if you're regularly making exceptions, especially for the same cause, that's a sign that you should consider lowering your SLOs, or increasing the time/effort you are putting into improving reliability.

In summary, every non-trivial service has dependencies, probably many of them. When choosing an SLO for your service, don't think about your dependencies and what SLO you can achieve—instead, think about your users, and what level of service they need to be happy. Once you have an SLO, your dependencies represent sources of risk, but they're not the only sources. Analyze all of the sources of risk together to predict whether you'll be able to consistently meet your SLO and prioritize which risks to mitigate.

1 If you're interested, The Calculus of Service Availability has more in-depth discussion about modeling risks from dependencies, and strategies for mitigating them.

Announcing Stackdriver Kubernetes Monitoring: Comprehensive Kubernetes observability from the start

If you use Kubernetes, you know how much easier it makes it to build and deploy container-based applications. But that’s only one part of the challenge: you need to be able to inspect your application and underlying infrastructure to understand complex system interactions and debug failures, bottlenecks and other abnormal behavior—to ensure your application is always available, running fast, and doing what it's supposed to do. Up until now, observing a complex Kubernetes environment has required manually stitching together multiple tools and data coming from many sources, resulting in siloed views of system behavior.

Today, we are excited to announce the beta release of Stackdriver Kubernetes Monitoring, which lets you observe Kubernetes in a comprehensive fashion, simplifying operations for both developers and operators.

Monitor multiple clusters at scale, right out of the box

Stackdriver Kubernetes Monitoring integrates metrics, logs, events, and metadata from your Kubernetes environment and from your Prometheus instrumentation, to help you understand, in real time, your application’s behavior in production, no matter your role and where your Kubernetes deployments run.

As a developer, for instance, this increased observability lets you inspect Kubernetes objects (e.g., clusters, services, workloads, pods, containers) within your application, helping you understand the normal behavior of your application, as well as analyze failures and optimize performance. This helps you focus more on building your app and less on instrumenting and managing your Kubernetes infrastructure.

As a Site Reliability Engineer (SRE), you can easily manage multiple Kubernetes clusters in a single place, regardless of whether they’re running on public or private clouds. Right from the start, you get an overall view of the health of each cluster and can drill down and up the various Kubernetes objects to obtain further details on their state, including viewing key metrics and logs. This helps you proactively monitor your Kubernetes environment to prevent problems and outages, and more effectively troubleshoot issues.

If you are a security engineer, audit data from your clusters is sent to Stackdriver Logging where you can see all of the current and historical data associated with the Kubernetes deployment to help you analyze and prevent security exposures.

Works with open source

Stackdriver Kubernetes Monitoring integrates seamlessly with the leading Kubernetes open-source monitoring solution, Prometheus. Whether you want to ingest third-party application metrics, or your own custom metrics, your Prometheus instrumentation and configuration works within Stackdriver Kubernetes Monitoring with no modification.

At Google, we believe that having an enthusiastic community helps a platform stay open and portable. We are committed to continuing our contributions to the Prometheus community to help users run and observe their Kubernetes workloads in the same way, anywhere they want.

To this end, we will expand our current integration with Prometheus to make sure all the hooks we need for our sidecar exporter are available upstream by the time Stackdriver Kubernetes Monitoring becomes generally available.

We also want to extend a warm welcome to Fabian Reinartz, one of the Prometheus maintainers, who has just joined Google as a Software Engineer. We're excited about his future contributions in this space.

Works great alone, plays better together

Stackdriver Kubernetes Monitoring allows you to get rich Kubernetes observability all in one place. When used together with all the other Stackdriver products, you have a powerful toolset that helps you proactively monitor your Kubernetes workloads to prevent failure, speed up root cause analysis and reduce your mean-time-to-repair (MTTR) when issues occur.

For instance, you can configure alerting policies using Stackdriver's multi-condition alerting system to learn when there are issues that require your attention. Or you can explore various other metrics via our interactive metrics explorer, and pursue root cause hypotheses that may lead you to search for specific logs in Stackdriver Logging or inspect latency data in Stackdriver Trace.

Easy to get started on any cloud or on-prem

Stackdriver Kubernetes Monitoring is pre-integrated with Google Kubernetes Engine, so you can immediately use it on your Kubernetes Engine workloads. It can also be integrated with Kubernetes deployments on other clouds or on-prem infrastructure, so you can access a unified collection of logs, events, and metrics for your application, regardless of where your containers are deployed.


Stackdriver Kubernetes Monitoring gives you:
  • Reliability: Faster time-to-resolution for issues thanks to comprehensive visibility into your Kubernetes environment, including infrastructure, application and service data. 
  • Choice: Ability to work with any cloud, accessing a unified collection of metrics, logs, and events for your application, regardless of where your containers are deployed.
  • A single source of truth: Customized views appropriate for developers, operators, and security engineers, drawing from a single, unified source of truth for all logs, metrics and monitoring data.
Early access customers have used Stackdriver Kubernetes Monitoring to increase visibility into their Kubernetes environments and simplify operations.
"Given the scale of our business we often have to use multiple tools to help manage the complex environment of our infrastructure. Every second is critical for eBay as we aim to easily connect our millions active buyers with the items they’re looking for. With the early access to Stackdriver Kubernetes Monitoring, we saw the benefits of a unified solution, which helps provide us with faster diagnostics for the eBay applications running on Kubernetes Engine, ultimately providing our customers with better availability and less latency.”

-- Christophe Boudet, Staff Devops, eBay

Getting started with Stackdriver Kubernetes Monitoring 

Stackdriver Kubernetes Monitoring Beta is available for testing in Kubernetes Engine alpha clusters today, and will be available in production clusters as soon as Kubernetes 1.10 rolls out to Kubernetes Engine.

Please help us help you improve your Kubernetes operations! Try Stackdriver Kubernetes Monitoring today and let us know how we can make it better and easier for you to manage your Kubernetes applications. Join our user group and send us your feedback at stackdriver-kubernetes-monitoring-users@googlegroups.com

 To learn more, visit https://cloud.google.com/kubernetes-monitoring/

 And if you’re at KubeCon in Copenhagen join us at our booth for a deep dive demo and discussion

Apigee named a Leader in the Gartner Magic Quadrant for Full Life Cycle API Management for the third consecutive time

APIs are the de-facto standard for building and connecting modern applications. But securely delivering, managing and analyzing APIs, data and services, both inside and outside an organization, is complex. And it’s getting even more challenging as enterprise IT environments grow dependent on combinations of public, private and hybrid cloud infrastructures.

Choosing the right APIs can be critical to a platform’s success. Likewise, full lifecycle API management can be a key ingredient in running a successful API-based program. Tools like Gartner’s Magic Quadrant for Full Life Cycle API Management help enterprises evaluate these platforms so they can find the right one to fit their strategy and planning.

Today, we’re thrilled to share that Gartner has recognized Apigee as a Leader in the 2018 Magic Quadrant for Full Life Cycle API Management. This year, Apigee was not only positioned furthest on Gartner’s “completeness of vision” axis for the third time running, it was also positioned highest in “ability to execute.”

Ticketmaster, a leader in ticket sales and distribution, has used Apigee since 2013. The company uses the Apigee platform to enforce consistent security across its APIs, and to help reach new audiences by making it easier for partners and developers to build upon and integrate with Ticketmaster services.

"Apigee has played a key role in helping Ticketmaster build its API program and bring ‘moments of joy’ to fans everywhere, on any platform," said Ismail Elshareef, Ticketmaster's senior vice president of fan experience and open platform.

We’re excited that APIs and API management have become essential to how enterprises deliver applications in and across clouds, and we’re honored that Apigee continues to be recognized as a leader in its category. Most importantly, we look forward to continuing to help customers innovate and accelerate their businesses as part of Google Cloud.

The Gartner 2018 Magic Quadrant for Full Life Cycle Management is available at no charge here.

To learn more about Apigee, please visit the Apigee website.

This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available from Apigee here.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Cloud-native architecture with serverless microservices — the Smart Parking story

By Brian Granatir, SmartCloud Engineering Team Lead, Smart Parking

Editor’s note: When it comes to microservices, a lot of developers ask why they would want to manage many services rather than a single, big, monolithic application? Serverless frameworks make doing microservices much easier because they remove a lot of the service management overhead around scaling, updating and reliability. In this first installment of a three-part series, Google Cloud Platform customer Smart Parking gives us their take on event-driven architecture using serverless microservices on GCP. Then read on for parts two and three, where they walk through how they built a high-volume, real-world smart city platform on GCP—with code samples!

Part 1

When "the cloud" first appeared, it was met with skepticism and doubt. “Why would anyone pay for virtual servers?” developers asked. “How do you control your environment?” You can't blame us; we're engineers. We resist change (I still use vim), and believe that proof is always better than a promise. But, eventually we found out that this "cloud thing" made our lives easier. Resistance was futile.

The same resistance to change happened with git (“svn isn't broken”) and docker (“it's just VMs”). Not surprising — for every success story, for every promise of a simpler developer life, there are a hundred failures (Ruby on Rails: shots fired). You can't blame any developer for being skeptical when some random "bloke with a blog" says they found the next great thing.

But here I am, telling you that serverless is the next great thing. Am I just a bloke? Is this a blog? HECK YES! So why should you read on (other than for the jokes, obviously)? Because you might learn a thing or two about serverless computing and how it can be used to solve non-trivial problems.

We developed this enthusiasm for serverless computing building a smart city platform. What is a smart city platform, you ask? Imagine you connect all the devices and events that occur in a city to improve resource efficiency and quality of citizen life. The platform detects a surge in parking events and changes traffic lights to help the flow of cars leaving downtown. It identifies a severe rainstorm and turns on street lights in the middle of the day. Public trash cans alert sanitation when they are full. Nathan Fillion is spotted on 12th street and it swarm-texts local citizens. A smart city is a vast network of distributed devices (IoT City 2000!) streaming data and methods to easily correlate these events and react to them. In other words, it's a hard problem with a massive scale—perfect for serverless computing!
In-ground vehicle detection sensor

What the heck is serverless?

But before we go into a lot more depth about the platform, let’s define our terms. In this first article, we give a brief overview of the main concepts used in our smart city platform and how they match up with GCP services. Then, in the second article, we'll dive deeper into the architecture and how each specific challenge was met using various different serverless solutions. Finally, we'll get extra technical and look at some code snippets and how you can maximize functionality and efficiency. In the meantime, if you have any questions or suggestions, please don't hesitate to leave a comment or email me directly (brian.granatir@smartparking.com).

First up, domain-driven design (DDD). What is domain-driven design? It's a methodology for designing software with an emphasis on expertise and language. In other words, we recognize that engineering, of any kind, is a human endeavour whose success relies largely on proper communication. A tiny miscommunication [wait, we're using inches?] can lead to massive delays or customer dissatisfaction. Developing a domain helps assure that everyone (not just the development team) is using the same terminology.

A quick example: imagine you’re working on a job board. A client calls customer support because a job they just posted never appeared online. The support representative contacts the development team to investigate. Unfortunately, they reach your manager, who promptly tells the team, “Hey! There’s an issue with a job in our system.” But the code base refers to job listings as "postings" and the daily database tasks as "jobs." So naturally, you look at the database "jobs" and discover that last night’s materialization failed. You restart the task and let support know that the issue should be resolved soon. Sadly, the customer’s issue wasn’t addressed, because you never addressed the "postings" error.

Of course, there are more potent examples of when language differences between various aspects of the business can lead to problems. Consider the words "output," "yield," and "spike" for software monitoring a nuclear reactor. Or, consider "sympathy" and "miss" for systems used by Klingons [hint: they don’t have words for both]. Is it too extreme to say domain-driven design could save your life? Ask a Klingon if he’ll miss you!

In some ways, domain-driven design is what this article is doing right now! We're establishing a strong, ubiquitous vocabulary for this series so everyone is on the same page. In part two, we'll apply DDD to our example smart city service.

Next, let's discuss event-driven architecture. Event-driven architecture (EDA) means constructing your system as a series of commands and/or events. A user submits an online form to make a purchase: that's a command. The items in stock are reserved: that's an event. A confirmation is sent to the user: that's an event. The concept is very simple. Everything in our system is either a command or an event. Commands lead to events and events may lead to new commands and so on.

Of course, defining events at the start of a project requires a good understanding of the domain. This is why it's common to see DDD and EDA together. That said, the elegance of a true event-driven architecture can be difficult to implement. If everything is a command or an event, where are the objects? I got that customer order, but where do I store the "order" and how to I access it? We'll investigate this in much more detail in part two of this series. For now, all you need to understand is that our example smart city project will be defining everything as commands and events!

Now, onto serverless. Serverless computing simply means using existing, auto-scaling cloud services to achieve system behaviours. In other words, I don't manage any servers or docker containers. I don't set up networks or manage operation (ops). I merely provide the serverless solution my recipe and it handles creation of any needed assets and performs the required computational process. A perfect example is Google BigQuery. If you haven't tried it out, please go do that. It's beyond cool (some kids may even say it's "dank": whatever that means). For many of us, it’s our first chance to interact with a nearly-infinite global compute service. We're talking about running SQL queries against terabytes of data in seconds! Seriously, if you can't appreciate what BigQuery does, then you better turn in your nerd card right now (mine says "I code in Jawa").

Why does serverless computing matter? It matters because I hate being woken up at night because something broke on production! Because it lets us auto-scale properly (instead of the cheating we all did to save money *cough* docker *cough*). Because it works wonderfully with event-driven architectures and microservices, as we'll see throughout parts 2 & 3 of this series.

Finally, what are microservices? Microservices is a philosophy, a methodology, and a swear word. Basically, it means building our system in the same way we try to write code, where each component does one thing and one thing only. No side effects. Easy to scale. Easy to test. Easier said than done. Where a traditional service may be one database with separate read/write modules, an equivalent microservices architecture may consist of sixteen databases each with individual access management.

Microservices are a lot like eating your vegetables. We all know it sounds right, but doing it consistently is a challenge. In fact, before serverless computing and the miracles of Google's cloud queuing and database services, trying to get microservices 100% right was nearly impossible (especially for a small team on a budget). However, as we'll see throughout this series, serverless computing has made microservices an easy (and affordable) reality. Potatoes are now vegetables!

With these four concepts, we’ve built a serverless sandwich, where:
  • Domain-driven design is the peanut butter, defining the language and context of our project 
  • Event-driven architecture is the jelly, limiting the scope of our domain to events 
  • Microservices: is the bread, limiting our architecture to tiny components that react to single event streams
And finally, serverless is having someone else make the sandwich for you (and cutting off the crust), running components on auto-scaling, auto-maintained compute services.

As you may have guessed, we're going to have a microservice that reacts to every command and event in our architecture. Sounds crazy, but as you'll see, it's super simple, incredibly easy to maintain, and cheap. In other words, it's fun. Honestly, remember when coding was fun? Time to recapture that magic!

To repeat, serverless computing is the next big thing! It's the peanut butter and jelly sandwich of software development. It’s an uninterrupted night’s sleep. It's the reason I fell back in love with web services. We hope you’ll come back for part two where we take all these ideas and outline an actual architecture.