Author Archives: GCP Team

Introducing Jib — build Java Docker images better



Containers are bringing Java developers closer than ever to a "write once, run anywhere" workflow, but containerizing a Java application is no simple task: You have to write a Dockerfile, run a Docker daemon as root, wait for builds to complete, and finally push the image to a remote registry. Not all Java developers are container experts; what happened to just building a JAR?

To address this challenge, we're excited to announce Jib, an open-source Java containerizer from Google that lets Java developers build containers using the Java tools they know. Jib is a fast and simple container image builder that handles all the steps of packaging your application into a container image. It does not require you to write a Dockerfile or have docker installed, and it is directly integrated into Maven and Gradle—just add the plugin to your build and you'll have your Java application containerized in no time.

Docker build flow:

Jib build flow:


How Jib makes development better:


Jib takes advantage of layering in Docker images and integrates with your build system to optimize Java container image builds in the following ways:
  1. Simple - Jib is implemented in Java and runs as part of your Maven or Gradle build. You do not need to maintain a Dockerfile, run a Docker daemon, or even worry about creating a fat JAR with all its dependencies. Since Jib tightly integrates with your Java build, it has access to all the necessary information to package your application. Any variations in your Java build are automatically picked up during subsequent container builds.
  2. Fast - Jib takes advantage of image layering and registry caching to achieve fast, incremental builds. It reads your build config, organizes your application into distinct layers (dependencies, resources, classes) and only rebuilds and pushes the layers that have changed. When iterating quickly on a project, Jib can save valuable time on each build by only pushing your changed layers to the registry instead of your whole application.
  3. Reproducible - Jib supports building container images declaratively from your Maven and Gradle build metadata, and as such can be configured to create reproducible build images as long as your inputs remain the same.

How to use Jib to containerize your application

Jib is available as plugins for Maven and Gradle and requires minimal configuration. Simply add the plugin to your build definition and configure the target image. If you are building to a private registry, make sure to configure Jib with credentials for your registry. The easiest way to do this is to use credential helpers like docker-credential-gcr. Jib also provides additional rules for building an image to a Docker daemon if you need it.

Jib on Maven
<plugin>
  <groupId>com.google.cloud.tools</groupId>
  <artifactId>jib-maven-plugin</artifactId>
  <version>0.9.0</version>
  <configuration>
    <to>
      <image>gcr.io/my-project/image-built-with-jib</image>
    </to>
  </configuration>
</plugin>
# Builds to a container image registry.
$ mvn compile jib:build
# Builds to a Docker daemon.
$ mvn compile jib:dockerBuild
Jib on Gradle
plugins {
  id 'com.google.cloud.tools.jib' version '0.9.0'
}
jib.to.image = 'gcr.io/my-project/image-built-with-jib'
# Builds to a container image registry.
$ gradle jib
# Builds to a Docker daemon.
$ gradle jibDockerBuild

We want everyone to use Jib to simplify and accelerate their Java development. Jib works with most cloud providers; try it out and let us know what you think at github.com/GoogleContainerTools/jib.

Five can’t-miss application development sessions at Google Cloud Next ‘18

Google Cloud Next ‘18 will be a developer’s paradise, with bootcamps, hands-on labs, and yes, breakout sessions—more than 60 dedicated to app dev in some form or another. And that’s before we get to the Spotlight sessions explaining new product launches! We polled developer advocates and product managers from across Google Cloud, and here are their picks for the sessions you can’t afford to miss.

1. From Zero to Production: Build a Production-Ready Deployment Pipeline for Your Next App

Scott Feinberg, Customer Engineer, Google Cloud

Want to start deploying to Google Cloud Platform (GCP) but aren't sure how to start? In this session, you'll take an app with multiple process types, containerize it, and build a deployment pipeline with Container Builder to test and deploy your code to a Kubernetes Engine cluster.

Register for the session here.

2. Enterprise-Grade Mobile Apps with Firebase

Michael McDonald, Product Manager and Jonathan Shriver-Blake, Product Manager, Google Firebase

Firebase helps mobile development teams build better apps, improve app quality, and grow their business. But before you can use it in your enterprise, you’ll have to answer a number of questions: Will it scale in production? Is it reliable, and can your team monitor it? How do you control who has access to production data? What will the lawyers say? And how about compliance and GDPR? This session will show you the answers to these questions and pave the way to use Firebase in your enterprise.

Click here to reserve your spot.

3. Migrating to Cloud Spanner

Niel Markwick, Solutions Architect and Sami Zuhuruddin, Staff Solutions Architect, Google Cloud

When migrating an existing database to Cloud Spanner, an essential step is importing the existing data. This session describes the steps required to migrate the data and any pitfalls that need to be dealt with during the process. We'll cover what it looks like to transition to Cloud Spanner, including schema migration, data movement, cutover, and application changes. To make it real, we'll be looking at migrating from two popular systems: one NoSQL and the other SQL.

Find more details about the session here.

4. Serverless Compute on Google Cloud: What's New

Myles Borins, Developer Advocate and Jason Polites, Product Manager, Google

Join us to learn what’s new in serverless compute on GCP. We will share the latest developments in App Engine and Cloud Functions and show you how you can benefit from new feature releases. You will also get a sneak peek and preview of what’s coming next.

Secure your spot today.

5. Accelerating Your Kubernetes Development with Kubernetes Applications

Konrad Delong, Senior Software Engineer; David Eustis, Senior Staff Software Engineer; and Kenneth Owens, Software Engineer, Google

Kubernetes applications provide a new, powerful abstraction for you to compose and re-use application building blocks from a variety of sources. In this talk, we’ll show you how to accelerate your development process by taking advantage of Kubernetes applications. We’ll walk you through creating these applications and deploying third-party, commercial Kubernetes applications from the Google Cloud Marketplace.

Click here to register for this session.

And if you haven’t already registered for Next, don’t delay! Everyone who attends will receive $500 in GCP credits. Imagine the possibilities!

Introducing Endpoint Verification: visibility into the desktops accessing your enterprise applications



While corporate devices are the key to employee productivity, they can also be the weak link when it comes to application and data security. Today we are introducing Endpoint Verification, giving admins an overview of the security posture of their laptop and desktop devices. Having that inventory of what computers employees are using provides valuable information which the enterprise can use to maintain security. Available to all Google Cloud Platform (GCP), Cloud Identity, G Suite Business, and G Suite Enterprise customers, Endpoint Verification consists of a Chrome extension and native app and is available for ChromeOS, macOS, and Windows devices.
Endpoint Verification is available as a Chrome extension

With the proliferation of multiple platforms and bring your own device (BYOD) in the enterprise, administrators find full MDM solutions to be difficult to deploy and maintain. Endpoint Verification offers a lightweight, easy-to-deploy solution to desktop device reporting for GCP, Cloud Identity and G Suite customers.

With Endpoint Verification, enterprises get two key value adds immediately. First, you can now build an inventory of devices within the enterprise that access corporate data. And second, with Endpoint Verification, admins have access to device information including: screen lock, disk encryption, and OS version.

For information on how to deploy Endpoint Verification, please visit the help center. For organizations that would like to try this out, a free trial of Cloud Identity is available here.

Last month today: GCP in June

In June, we had a lot to discuss about getting the most out of the cloud for your business, from speeding up web traffic to running fully managed apps easily. Here’s a quick look at some of the highlights from Google Cloud Platform (GCP) news this month.

What caught your attention this month

Some of the most-read stories this month reflected new technology developments or integrations that will be useful for developers and engineers.
  • You can now deploy your Node.js app to the Google App Engine standard environment—and based on readership, many of you are excited about this. Node.js works easily on App Engine, without any language, module or API restrictions. You’ll get very quick deployment times, and a fully managed experience once you’ve deployed those apps, just as in other apps on the fully managed App Engine.
  • QUIC is a transport protocol, optimized for HTTPS, that makes web traffic run faster. The protocol itself isn’t new, but last month we announced QUIC support for our HTTPS load balancers. Network performance is a huge part of a successful public cloud operation, so this new support could make a big impact on web page load times for your cloud services. Enabling QUIC means your connections can be established faster, which is especially useful for latency-prone connections, and clients who don’t yet support QUIC will seamlessly continue to use HTTPS.
  • If you’re a Kubernetes fan, you may have already explored the new kubemci command-line interface (CLI). It lets you configure ingress for multi-cluster Kubernetes Engine environments, using Cloud Load Balancer. It’s also the first step in a long-term solution that will consist of a multi-cluster ingress system controlled via kubectl CLI or Kubernetes API calls.

Hot topics

You can now run your GCP workloads in Finland to improve availability and reduce your latency in the Nordics, and we announced that the Los Angeles region will open next month.

We also added some new storage tools to your arsenal. We’re adding Cloud Filestore as a GCP storage option so you can run enterprise applications that need a file system interface and shared file system for data. It’s fully managed and offers high performance for applications that need low latency and high throughput. For those of you supporting and running creative industry applications on GCP infrastructure, Cloud Filestore works great for render farms, website hosting and content management systems.

In addition, the Transfer Appliance became generally available in June, allowing a type of cloud data migration that will work well if you’ve got more than 20TB of data to upload to GCP, or that would take more than a week to upload. In early use, Transfer Appliance customers have gotten quick starts on analytics projects by moving test data to GCP, along with moving backup data and some or all of a data center to GCP.

And in the “Cloud powers some very cool projects” category, take a look at how the new Dragon Ball Legends game creator built the backend on GCP. Bandai Namco Entertainment knew that players of the latest addition to their Dragon Ball Z franchise would want to play against one another in real-time, with players around the globe. They turned to GCP for the scalability, global reach and real-time analytics they needed to make that possible.

Behind the compute curtain

This news of sole-tenant nodes for Google Compute Engine will come in handy for those of you at companies that need dedicated cloud servers. With this option, it’s possible to launch new VM instances as usual, but on server capacity dedicated to you. This choice is nice for industries with strict compliance and regulatory rules around data, and for getting higher utilization from VM instances along with instance placement, done either manually or by Compute Engine.

Building applications on GCP involves some upfront choices for app developers: Which compute offering will you pick, and what language will you use? Whether you’re a fan of containers or VMs, containers, App Engine or Cloud Functions, you’ll find in this post some excellent concrete examples the time and effort involved in building a “Hello, World” app in each of GCP’s four compute platforms.

That’s a wrap for June. This month brings the Next ‘18 conference, July 24-26. Join us and thousands of other IT practitioners in San Francisco to learn all you need to know about building a modern cloud infrastructure. Till then, build away!

Kubernetes 1.11: a look from inside Google



Congratulations to everyone involved in the recent Kubernetes 1.11 release. Now that the core has been stabilized, we here at Google have been focusing our upstream work on increasing Kubernetes’ plugability, i.e., moving more pieces out into other repositories. As the project has matured, adding a plugin no longer means "sending Tim Hockin a pull request," but instead means creating proper, well-defined interfaces with names like CNI, CRI and CSI. In fact, this maturity and extendability has been one of the things that helps us make Google Kubernetes Engine an enterprise-ready platform. Back in March, we gave you a look at what was new in Kubernetes 1.10. Now, with the release of 1.11, let’s take a look at the core Kubernetes work that Google is driving, as well as some of the innovation we've built on Kubernetes’ foundations in the last three months.

New features in 1.11

Priority and preemption
Pod priority and preemption is one of the main features of our internal scheduling system that lets us achieve high resource utilization in our data centers. We wrote about that key use case when we introduced it in Alpha in Kubernetes 1.9, and since then, we’ve added improved scheduling performance and better support for critical system pods. Now, we're pleased to move it to Beta in this release, meaning it’s enabled by default in Kubernetes Engine clusters that run 1.11. This is a feature that many users who run larger clusters have been waiting for!

Changes to CRDs
Custom Resource Definitions (CRDs) are one of the most popular extension mechanisms for Kubernetes, and new features in 1.11 make them even more powerful. CRDs are used for a broad array of Kubernetes extensions, for example to enable the use of Spark or Functions natively through the Kubernetes API.

Kubernetes objects have a schema version (e.g. v1beta1 or v1), but we only ever store one version in the etcd database. When you query an object at a particular version, a server-side conversion is done to convert the object to match the schema of the version you request.

Previously, CRD authors had to delete and recreate resources to move them between different versions. In 1.11, you can now define multiple versions for your own resources. The next step will be to enable server-side conversion for CRD, to allow for schema changes like renaming fields, without breaking existing clients.

Cloud Provider plugins
Google continues to invest in the long-term sustainability and multi-cloud portability of core Kubernetes. The Cloud Provider interface allows infrastructure providers to deliver a "batteries-included" experience for user workloads on their platform, powering common services like dynamic provisioning and management of storage and external load balancing for Services.

This code is currently compiled into Kubernetes core binaries. Google is leading a long running effort to extract this functionality into provider-specific repositories, in order to reduce the scope of the Kubernetes core. This will also allow providers to deliver enhancements and fixes to users more quickly than Kubernetes’ three-month release cadence. As a part of this effort, we’re excited to announce the creation of SIG-Cloud Provider to provide technical oversight and governance for this effort.

New features not in 1.11

That's not a headline you normally see, right?

One thing that is not in 1.11 — not even a bit of it — is Server-side Apply, a feature which moves the logic for kubectl apply from the client to server, making the expected behavior clearer, and allowing more clients to take advantage of server-side processing without shelling out to kubectl.

Normally, a feature like this would be committed to the project as it was built. But if a release is due, and the feature isn't ready, a large amount of effort would be required to go towards reverting it. Instead, Google has been leading the effort to introduce feature branches in Kubernetes, which let us work on long-running features in parallel to the main codebase. This lets us avoid last-minute scrambles to adjust for surprises, and is an example of how we are working to ensure the stability of the Kubernetes project.

Work on server-side apply is happening in the open in its feature branch, and we look forward to welcoming it into Kubernetes when it's ready — and not a moment before.

Kubernetes ecosystem work
Our work with Kubernetes doesn't stop at releasing core binaries every three months. Some of the work we are most excited about is in the form of extensions we've released since the last Kubernetes release:

Kustomize
We've thought a lot about how to declaratively manage application configuration. A common pattern that we saw was the use of templating solutions such as Helm (based on Google Cloud's Deployment Manager), which requires a user to learn a different configuration language than what the API server returns when you query it. A templating approach also means that if you download a YAML example, you have to turn it into a template before you can use it in your environment.

With kustomize, we're introducing a new approach to application definition. Kustomize allows you to apply overlays to existing YAML configurations, so you can customize a forked repository with your local changes, or define different configs for 'staging' and 'production' with different configs and replica counts.

Kustomize is well suited for a GitOps-style workflow, where there's a common base configuration that is tweaked in various directions with overlays to create different variants. The base and overlays can be managed by separate teams in different repositories.

Application API
Applications are made up of many services and resources, but the whole is more than the sum of its parts. After they are created, there is no well-defined way of identifying all the parts that relate to an application to Kubernetes. We want cluster users to be able to think in terms of their applications, and allow tools and UIs to define, update and display an application-centric view of your cluster.

The new Application API provides a way to aggregate Kubernetes components (e.g. Services, Deployments, StatefulSets, Ingresses, CRDs), and manage them as a group.

We have had contributions from friends at Samsung, Bitnami, Heptio, Red Hat and more, and we are looking for more contributions and feedback to ensure that the project adds value across the community.

The Application API is currently in Alpha. We hope to promote it to Beta in the next few weeks, and you'll hear more about it from us then.

Looking forward to Kubernetes Engine

If you'd like to get access to Kubernetes 1.11 on Kubernetes Engine ahead of general availability, please complete this form.

And if you liked reading this post, you'll love the Kubernetes Podcast from Google, which I co-host with Adam Glick. Every Tuesday we take a look at the week’s news and talk with Googlers or members of the wider Kubernetes community. So far we've spoken about product launches, processes and community, and this week we talk to the Kubernetes 1.11 release leads. Subscribe now!

New GitHub repo: Using Firebase to add cloud-based features to games built on Unity



A while back, a group of us Google Cloud Platform Developer Programs Engineers teamed up with gaming fans in Firebase Engineering to work on an interesting project. We all love games, gamers, and game developers, and we wanted to support those developers with solutions that accomplish common tasks so they can focus more on what they do best: making great games.

The result was Firebase Unity Solutions. It’s an open-source github repository with sample projects and scripts. These projects utilize Firebase tools and services to help you add cloud-based features to your games being built on Unity.

Each feature will include all the required scripts, a demo scene, any custom editors to help you better understand and use the provided assets, and a tutorial to use as a step-by-step guide for incorporating the feature into your game.

The only requirements are a Unity project with the .NET 2.0 API level enabled, and a project created with the Firebase Console.

Introducing Firebase Leaderboard


Our debut project is the Firebase_Leaderboard, a set of scripts that utilize Firebase Realtime Database to create and manage a cross-platform high score leaderboard. With the LeaderboardController MonoBehaviour, you can retrieve any number of unique users’ top scores from any time frame. Want the top 5 scores from the last 24 hours? Done. How about the top 100 from last week? You got it.

Once a connection to Firebase is established, scores are retrieved automatically, including any new scores that come in while the controller is enabled.

If any of those parameters are modified (the number of scores to retrieve, or the start or end date), the scores are automatically refreshed. The content is always up-to-date!

private void Start() {
    this.leaderboard = FindObjectOfType();
    leaderboard.FirebaseInitialized += OnInitialized;
    leaderboard.TopScoresUpdated += UpdateScoreDisplay;
    leaderboard.UserScoreUpdated += UpdateUserScoreDisplay;
    leaderboard.ScoreAdded += ScoreAdded;

    MessageText.text = "Connecting to Leaderboard...";
}
With the same component, you can add new scores for current users as well, meaning a single script handles both read and write operations on the top score data.

public void AddScore(string userId, int score) {
    leaderboard.AddScore(userId, score);
}
For step-by-step instructions on incorporating this cross-platform leaderboard into your Unity game using Firebase Realtime Database, follow the instructions here. Or check out the Demo Scene to see a version of the leaderboard in action!

We want to hear from you

We have ideas for what features to add to this repository moving forward, but we want to hear from you, too! What game feature would you love to see implemented in Unity using Firebase tools? What cloud-based functionality would you like to be able to drop directly into your game? And how can we improve the Leaderboard, or other solutions as they are added? You can comment below, create feature requests and file bugs on the github repo, or join the discussion in this Google Group.

Let’s make great games together!

Understanding error budget overspend – part one – CRE life lessons



In previous CRE Life Lessons blog posts, the Google Customer Reliability Engineering (CRE) team has spent a lot of time talking about service level objectives (SLOs), which measure whether your service is meeting its reliability targets from the point of view of its end users. Your SLO lets you specify how much downtime your service can have in a given period—for example, 43 minutes every 30 days for a service that needs to be available 99.9% of the time. This downtime allowance is your error budget. Like a household budget, it’s OK to spend this error budget over those 30 days, as long as you don’t spend more than that.

If you do run out of your error budget, either by spending a bit too much each day, or by having a major outage that blows it all at once, that tells you that your service’s users are suffering too much and it’s time to give them a break. How do you do that? Here are a few questions to consider to see if you need to recalibrate your error budget.

Where are you spending your error budget?

Your SLOs will be target values for corresponding service level indicators (SLIs), which are the measurements of the critical parts of the end-user experience. One SLI for the 99.9% available example system above might be “the percent of HTTP responses which are successful (200), out of all 20x and 50x HTTP responses.” You calculate your error budget spend by the percent of the measurement period where your service fails to reach all of its SLO targets; depending on the granularity and accuracy of your SLI measurement, this might be done on a per-minute, per-hour or even per-day basis.

When you analyze your error budget spend day-by-day, you should try to attribute the main causes of error budget spend over the measurement period:
  • Do most of your errors happen when you’re doing binary releases? That implies that you’re not going to be able to keep within budget unless you do something to make releases less frequent, less error-prone or lower-impact when there is an error.
  • Are you seeing steady error spend coming from intermittent application failure, which adds up to the majority of your budget? That’s telling you that you’ve got a fundamental failure in your application. It’s a strong signal that you need to drill down in your logs to find the troublesome queries, and that you should expect to dedicate some of your engineers to identify the root causes and either address them directly or plan to fix them in your next project planning cycle.
  • Are large chunks of your error budget getting spent by major application failures, where most of your service goes down for many minutes due to configuration pushes, excessive load or queries-of-death? You need to run effective postmortems to identify the root causes and mitigate them. You will need to redirect some of your development engineering effort to address the top action items from those postmortems—so feature development and releases will naturally happen more slowly. (More on this in another post.)
  • Is the bulk of your spend coming from a dependency outside your control, such as a critical backend or your compute platform? You’ll need to address the dependency or platform owner directly, showing them your SLI metrics and negotiating about how they can make their service more reliable—or how you can be more resilient to the expected failure modes.
For each of these cases, you have an objective measurement of whether the problem has been sufficiently addressed: you will expect your SLIs to stay high in circumstances where previously they plummeted.

Are you measuring the right signal?

Something else you should consider: Did the outage reflect real user pain? If you have a strong indicator that users weren’t concerned by a major outage that spent a chunk of your error budget, then you may not have to change your development practices or architecture, but you still have something to fix. Either you should determine a new, lower target level for your SLO, or you should find a different SLI that better represents the user experience.

Can your users tolerate a slightly worse experience?

Suppose you’re trying to run your service at a 99.9% availability level, with the corresponding 43-minute-per-month error budget, but you’re consistently failing to meet that; you’re spending 50-60 minutes per month. How much does that actually matter?

You probably have business intelligence channels for measuring customer happiness in terms of time spent on your site, purchase rate, support tickets raised and other fairly direct measurements of user happiness. Evaluate those statistics against your SLIs: Are your budget overspend periods correlated with less user happiness, and if so, what’s the correlation function? If a 50% error budget overspend corresponds to a 1% decrease in customer revenue, then you may feel that you can adjust your SLO target and aim for a 99.5% availability level, rather than spend a lot of engineering effort trying to raise your availability to the original target.

What is important in this case is to have, and document, the data used to determine the SLO target. You don’t want to fall into the trap of increasing your error budget by 50% each period because “users don’t really care”—you need to articulate the tradeoff in user happiness/spend vs. reliability in your SLO definition. An SLO specification shouldn’t just contain numbers and metric names. It should also reference the logic and data used to determine the SLO target.

When your users’ experience isn’t definitive

It may be true that the customer is always right— but what if your service’s users are part of your company? In some cases, the overall business decision may be that continuing to build and release the software is in the best interest of the company as a whole, even if you’re consistently going over budget. The error budget spend may cause an inconvenience to employees, but failing to release new versions of the software would have a significant cost to the company that outweighs user inconvenience.

This can occur when there's a disconnect between what the users of the software are perceived to need (for example, the 99.9% availability target of this example service) and what the executives who pay for the development of the software think these users should tolerate in the name of greater velocity.

Now that we understand what messages an error budget is telling us, in part two of this post we will look at how best to keep a positive balance.

Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!

Related content:

Good housekeeping for error budgets – part two – CRE life lessons



In part one of this CRE Life Lessons post, we talked about the error budget for a service and the information it tells you about how close your service is to breaching its reliability targets--the service-level objectives (SLOs). Once you’ve done some digging to understand why you may be consistently overspending your error budget, it’s time to fix the root causes of the problem.

Paying off your error budget debt

Those of us who have held a significant balance on a credit card are familiar with the large bite it can take out of a monthly household budget. Good housekeeping practice means that we should be looking to pay down as much of that debt as possible in order to shrink the monthly charge. Error budgets work the same way.

Once your error budget spend analysis identifies the major contributors to the spending rate, you should be prepared to redirect your developers’ efforts from new features to addressing those causes of spend. This might be an improved QA environment or test set to catch more release errors before they hit production, or better roll-out of automation and monitoring to detect and roll back bad releases more quickly.

The effect of this approach is likely to be that you make less frequent releases, or each release has fewer changes and hence is less likely to contain an error-budget-impacting problem. You’re slowing down release velocity temporarily in order to allow safer releasing at the original velocity in future.

Looking at downstream spend

Another issue to consider is: What if the error budget overspend wasn’t the developers’ fault? If your data center or cloud platform has a hardware outage, there’s not much the developers can do about it. Sure, your end users don’t care why the service broke, and you don’t want to make their lives worse, but it seems harsh to ding your developers for someone else’s failure. This should surface in your analysis of error budget spend, as described above.

What next? You may need to talk to the owners of that platform about their historical (measured) reliability and how it squares with you trying to run your service at your target SLO. It may be that changes are needed on both sides: You change your system to be able to detect and tolerate certain failures from them, and they improve detection and resolution time of the failure cases that impact you.

Often, a platform is not going to change significantly, so you have to decide how to account for that error spend in future. You may decide that it’s significant enough that you need to increase your resilience to it, e.g., by implementing (and exercising!) the option to fail your service automatically out of an affected cloud region over to an unaffected region. (See our “Defining SLOs for services with dependencies” blog post, which dealt with this problem in depth.)

When your releases are the problem

It could be, however, that your analysis leads you to the conclusion that software releases are a major source of your error budget spend. Certainly, our experience at Google is that binary rollouts are one of the top sources of outages; many a postmortem starts “We rolled out a new release of the software, which we thought did <X>, which our users would love, but in fact it did <Y>, which caused users to see errors in their browser/be unable to view photos of their cat/receive 100 email bounces a minute.”

The canonical response to a series of bad releases that overspend the error budget is to impose a freeze on the release of new features. This can be seen as a last-resort action; it acknowledges the existing efforts to pay down debt have not delivered sufficient reliability improvement, so lowering the rate of change is instead required to protect user experience. A freeze of this nature can also provide the space and direction to development teams to allow them to refocus their attention away from features onto reliability improvements. However, it’s a drastic step to take.

Other ways you can avoid freezing include:
  • Make an explicitly agreed-upon adjustment to the feature vs. reliability work balance. For example, your company normally does two-week sprints, where 95% of the work is feature-driven, and 5% is postmortem action items and other reliability work. You agree that while your service is out of SLO, the sprints will be instead be 50/50.
  • Overprovision your service. For instance, pay more money to replicate to another cloud zone and/or region, or run more replicas of your service to handle higher traffic loads. This is only effective if you have determined that this approach will help mitigate error budget spend.
  • Declare a reliability incident. Appoint a specific person to analyze the postmortem and error budget spend and come up with recommendations. It’s important that the business has pre-committed to prioritizing those recommendations.

Winter is coming

If you really have to impose a new features freeze, how long should it last? Generally, it should last until you have removed the overspend in your error budget, and have confidence it will not recur. We’ve seen two principal methods of error budget calculation: fixed intervals (say, each calendar month) and rolling intervals (the last N days).

If you operate a fixed interval for your error budget calculation, your reaction to an error budget overspend depends on when it happens. If it happens on day 1, you spend the whole month frozen; if it’s on day 28, you may not actually need to stop releasing because your next release may be in the next month, when the error budget is reset. Unless your customer is also sensitive to outages on a calendar month basis, this doesn’t seem to be a good practice to optimize your customers’ experience.

For a rolling 30-day error budget measurement period, your 99.9% available service gains the error budget lost in day N-30, so if your budget is 20 minutes overspent, now you need to wait until that 20 minutes of debt has dropped off your radar. So if you spent 15 minutes of your budget on day N-29 and five minutes on day N-28, you’d need to wait two more days to get back to a positive balance, assuming no further outages. In practice, you’d probably wait until you accumulate a buffer of 20% of your error budget so you are resilient to minor unexpected spends.

Following this guidance, if you have a major outage that spends your entire month’s budget in one day, then you’d be frozen for an entire month. In practice, this may not be acceptable. At the very least, though, you should be drastically down-scaling your release velocity in order to have more engineering time to devote to fixing the root causes (see “Paying off your error budget debt” above). There are other approaches, too: Check out the discussion about blocking releases in an earlier episode of CRE Life Lessons, where we analyzed an example escalation policy.

As you can see, the rolling period for error budget measurement is less prone to a varying reaction depending on the particular date of an outage. We recommend that you adopt this approach if you can, though admittedly it can be challenging to accumulate this kind of data in monitoring tools currently.

The long-term costs of freezes

Freezing the release of new features isn’t free of cost. In a worst-case scenario, if your developers are continuing new feature development but not releasing those features to users, the changes will build up, and when you finally resume releases it is almost inevitable that you’re going to see a series of broken releases. We’ve seen this happen in practice: if we impose a freeze on a service over an event like Black Friday or New Year’s, we expect that the week following the freeze will be unusually busy with service failures as all the backed-up changes reach users. To avoid this, it’s important to re-emphasize to teams affected by the freeze that it is intended to provide space to focus on reliability development, not feature development.

Sometimes it’s not possible to freeze all releases. Your company may have a major event coming up, such as a conference, and so there’s a compelling need to push certain new features into production no matter what the recent experience of its users. One process you could adopt in this case is the concept of a silver bullet: The product management team has a (very limited) right to override a release freeze to deploy a critical feature. To make this approach work well, that right needs to be expensive to exercise and limited in frequency: The spend of a silver bullet should be regarded as a failure, and require a postmortem to analyze how it came about and how to mitigate the risk of it happening again.

Using the error budget to your (and your users’) advantage

An error budget is a crucial concept when you’re taking a principled approach to service reliability. Like a household budget, it’s there for you (the service owner) to spend, and it's important for the service stakeholders to agree on what should happen when you overspend it ahead of doing so. If you find you’ve overspent, a feature freeze can be an effective tool to prioritize development time toward reliability improvements. But remember that reflexively freezing your releases when you blow through your error budget isn’t always the appropriate response. Consider where your budget is being spent, how to reduce the major sources of spend and whether some loosening of the purse strings is in order. The most important principle: Do it based on data!

Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!

Related content:

Why we believe in an open cloud



Open clouds matter more now than ever. While most companies today use a single public cloud provider in addition to their on-premises environment, research shows that most companies will likely adopt multiple public and private clouds in the coming years. In fact, according to a 2018 Rightscale study, 81-percent of enterprises with 1,000 or more employees have a multi-cloud strategy, and if you consider SaaS, most organizations are doing multi-cloud already.

Open clouds let customers freely choose which combination of services and providers will best meet their needs over time. Open clouds let customers orchestrate their infrastructure effectively across hybrid-cloud environments.

We believe in three principles for an open cloud:
  1. Open is about the power to pick up an app and move it—to and from on-premises, our cloud, or another cloud—at any time.
  2. Open-source software permits a richness of thought and continuous feedback loop with users.
  3. Open APIs preserve everyone’s ability to build on each other’s work.

1. Open is about the power to pick up an app and move it

An open cloud is grounded in a belief that being tied to a particular cloud shouldn’t get in the way of achieving your goals. An open cloud embraces the idea that the power to deliver your apps to different clouds while using a common development and operations approach will help you meet whatever your priority is at any given time—whether that’s making the most of skills shared widely across your teams or rapidly accelerating innovation. Open source is an enabler of open clouds because open source in the cloud preserves your control over where you deploy your IT investments. For example, customers are using Kubernetes to manage containers and TensorFlow to build machine learning models on-premises and on multiple clouds.

2. Open-source software permits a richness of thought and continuous feedback loop with users

Through the continuous feedback loop with users, open source software (OSS) results in better software, faster, and requires substantial time and investment on the part of the people and companies leading open source projects. Here are examples of Google’s commitment to OSS and the varying levels of work required:
  • OSS such as Android, has an open code base and development is the sole responsibility of one organization.
  • OSS with community-driven changes such as TensorFlow, involves coordination between many companies and individuals.
  • OSS with community-driven strategy, for example collaboration with the Linux Foundation and Kubernetes community, involves collaborative, decision-making and accepting consensus over control.
Open source is so important to Google that we call it out twice in our corporate philosophies, and we encourage employees, and in fact all developers, to engage with open source.

Using BigQuery to analyze GHarchive.org data, we found that in 2017, over 5,500 Googlers submitted code to nearly 26,000 repositories, created over 215,000 pull requests, and engaged with countless communities through almost 450,000 comments. A comparative analysis of Google’s contribution to open source provides a useful relative position of the leading companies in open source based on normalized data.

Googlers are active contributors to popular projects you may have heard of including Linux, LLVM, Samba, and Git.

Google regularly open-sources internal projects

Top Google-initiated projects include:

3. Open APIs preserve everyone’s ability to build on each other’s work

Open APIs preserve everyone’s ability to build on each other’s work, improving software iteratively and collaboratively. Open APIs empower companies and individual developers to change service providers at will. Peer-reviewed research shows that open APIs drive faster innovation across the industry and in any given ecosystem. Open APIs depend on the right to reuse established APIs by creating independent-yet-compatible implementations. Google is committed to supporting open APIs via our membership in the Open API Initiative, involvement in the Open API specification, support of gRPC, via Cloud Bigtable compatibility with the HBase API, Cloud Spanner and BigQuery compatibility with SQL:2011 (with extensions), and Cloud Storage compatibility with shared APIs.

Build an open cloud with us

If you believe in an open cloud like we do, we’d love your participation. You can help by contributing to and using open source libraries, and asking your infrastructure and cloud vendors what they’re doing to keep workloads free from lock-in. We believe open ecosystems grow the fastest and are more resilient and adaptable in the face of change. Like you, we’re in it for the long-term.



It’s worth noting that not all Google’s products will be open in every way at every stage of their life cycle. Openness is less of an absolute and more of a mindset when conducting business in general. You can, however, expect Google Cloud to continue investing in openness across our products over time, to contribute to open source projects, and to open source some of our internal projects.

If you believe open clouds are an important part of making this multi-cloud world a place in which everyone can thrive, we encourage you to check out our new open cloud website where we offer more detailed definitions and examples of the terms, concepts, and ideas we’ve discussed here: cloud.google.com/opencloud.

Announcing MongoDB Atlas free tier on GCP



Earlier this year, in response to strong customer demand, we announced that we were expanding region support for MongoDB Atlas. The MongoDB NoSQL database is hugely popular, and the MongoDB Atlas cloud version makes it easy to manage on Google Cloud Platform (GCP). We heard great feedback from users, so we’re further lowering the barrier to get started on MongoDB Atlas and GCP.

We’re pleased to announce that as of today, MongoDB will offer a free tier of MongoDB Atlas on GCP in three supported regions, strategically located in North America, Europe and Asia Pacific in recognition of our wide user install base.

The free tier will allow developers a no-cost sandbox environment for MongoDB Atlas on GCP. You can test any potential MongoDB workloads on the free tier and decide to upgrade to a larger paid Atlas cluster once you have confidence in our cloud products and performance.

As of today, these specific regions are supported by the Atlas free tier:
  1. Iowa (us-central1)
  2. Belgium (europe-west1)
  3. Singapore (asia-southeast1)
To get started, you’ll just need to log in to your MongoDB console, select “Build a New Cluster,” pick “Google Cloud Platform,” and look for the “Free Tier Available” message. The free tier utilizes MongoDB’s M0 instances. An M0 cluster is a sandbox MongoDB environment for prototyping and early development with 512MB of storage space. It also comes with strong enterprise features such as always-on authentication, end-to-end encryption and high availability, as well as monitoring. Happy experimenting!

Related content: