Author Archives: GCP Team

Announcing resource-based pricing for Google Compute Engine

The promise and benefit of the cloud has always been flexibility, low cost, and pay-per-use. With Google Compute Engine, custom machine types let you create VM instances of any size and shape, and we automatically apply committed use and sustained use discounts to reduce your costs. Today, we are taking the concept of pay-for-use in Compute Engine even further with resource-based pricing.

With resource-based pricing we are making a number of changes behind the scenes that align how we treat metering of custom and predefined machine types, as well as how we apply discounts for sustained use discounts. Simply put, we’ve made changes to automatically provide you with more savings and an easy-to-understand monthly bill. Who doesn’t love that?

Resource-based pricing considers usage at a granular level. Instead of evaluating your usage based on which machine types you use, it evaluates how many resources you consume over a given time period. What does that mean? It means that a core is a core, and a GB of RAM is a GB of RAM. It doesn’t matter what combination of pre-defined machine types you are running. Now we look at them at the resource level—in the aggregate. It gets better, too, because sustained use discounts are now calculated regionally, instead of just within zones. That means you can accrue sustained use discounts even faster, so you can save even more automatically.

To better understand these changes, and to get an idea of how you can save, let’s take a look at how sustained use discounts worked previously, and how they’ll work moving forward.
  • Previously, if you used a specific machine type (e.g. n1-standard-4) with four vCPUs for 50% of the month, you got an effective discount of 10%. If you used it for 75% of the month, you got an effective discount of 20%. If you use it for 100% of the month, you got an effective discount of 30%.
Okay. Now, what if you used different machine types?
  • Let’s say you were running a web-based service. You started the month running an n1-standard-4 with four vCPUs. In the second week user demand for your service increased and you scaled capacity. You started running an n1-standard-8 with eight vCPU. Ever increasing demand caused you to scale up again. In week three you began running an n1-standard-16 with sixteen vCPU. Due to your success you wound up scaling again—ending the month running an n1-standard-32 with thirty-two vCPU. In this scenario you wouldn’t receive any discount, because you didn’t run any of the machine types for up to 50% of the month.

With resource-based pricing, we no longer consider your machine type and instead, we add up all the resources you use across all your machines into a single total and then apply the discount. You do not need to take any action. You save automatically. Let’s look at the scaling example again, but this time with resource-based pricing.
  • You began the month running four vCPU, and subsequently scaled to eight vCPU, sixteen vCPU and finally thirty-two vCPU. You ran four vCPU all month, or 100% of the time, so you receive a 30% discount on those vCPU. You ran another four vCPU for 75% of the month, so you receive a 20% discount on those vCPU. And finally, you ran another eight vCPU for half the month, so you receive a 10% discount on those vCPU. Sixteen vCPU were run for one week, so they did not qualify for a discount. Let’s visualize how this works, to reinforce what we’ve learned.

And because resource-based pricing applies at a regional level, it’s now even easier for you to benefit from sustained use discounts, no matter which machine types you use, or the number of zones in a region in which you operate. Resource-based pricing will take effect in the coming months. Visit the Resource-based pricing page to learn more.

Cloud Services Platform: bringing the best of the cloud to you

In the decade since cloud computing became mainstream, it’s captured the hearts and minds of developers and enterprises everywhere. But for most IT organizations, cloud is still but a glimmer of what it could be—or what it should be. Today, we’re excited to share our vision for Cloud Services Platform, an integrated family of cloud services that lets you increase speed and reliability, improve security and governance and build once to run anywhere, across GCP and on-premise environments. With Cloud Services Platform, we bring the benefits of the cloud to you, no matter where you deploy your IT infrastructure today—or tomorrow.

Cloud Services Platform puts all your IT resources into a consistent development, management and control framework, automating away low-value and insecure tasks across your on-premise and Google Cloud infrastructure. Specifically, we’re announcing:
  • Service mesh: Availability of Istio 1.0 in open source, Managed Istio, and Apigee API Management for Istio
  • Hybrid computing: GKE On-Prem with multi-cluster management
  • Policy enforcement: GKE Policy Management, to take control of Kubernetes workloads
  • Ops tooling: Stackdriver Service Monitoring
  • Serverless computing: GKE Serverless add-on and Knative, an open source serverless framework
  • Developer tools: Cloud Build, a fully managed CI/CD platform
The Cloud Services Platform family

“We needed a consistent platform to deploy and manage containers on-premise and in the cloud. As Kubernetes has become the industry standard, it was natural for us to adopt Kubernetes Engine on GCP to reduce the risk and cost of our deployments.”
- Dinesh KESWANI, Global Chief Technology Officer at HSBC
Cloud Services Platform is technologically and architecturally aligned with the joint hybrid cloud products we've been developing and bringing to market with our partner, Cisco, with whom we have been collaborating closely. Our joint solution, Cisco Hybrid Cloud Platform for Google Cloud, will be generally available next month and is now certified to be consistent with Kubernetes Engine, enabling GCP out of the box.

Today, let’s take a look at aspects of the Cloud Services Platform, and how it lays a foundation for a fully realized cloud infrastructure.

Modernizing application architecture with Istio

Last year, we took a step toward helping organizations move from reactive IT management to proactive service operations—the idea of managing at a higher layer of the stack, enabling greater application awareness and control. In collaboration with several industry partners, we announced Istio, an open-source service mesh that gives operators the controls they need to manage microservices at scale. We are excited to say that open-source Istio will move to version 1.0 shortly, making it ready for production deployments.

Building on that open-source foundation, we are announcing a managed Istio service that you can use to manage services within a Kubernetes Engine cluster. Managed Istio, in alpha, is an Istio-powered service mesh available in Kubernetes Engine, complete with enterprise support. Managed Istio accelerates your journey to service operations with three high-level capabilities:
  • Service discovery and intelligent traffic management—Managed Istio surfaces all the services running in your cluster and manages network traffic between them. Using application-level load balancing and sophisticated traffic routing for container and VM workloads, it also provides health checks, plus canary and blue/green deployments, enabling fault tolerant applications with circuit breaking and timeouts.
  • Secure, authenticated communications—Managed Istio offers segmentation and granular policy for endpoints, compliance and detecting anomalous behavior, and traffic encryption by default using mTLS.
  • Monitoring and management—Understand and troubleshoot the system of services running across Managed Istio, including integration with Stackdriver, our suite of monitoring and management tools.
It's still early days, but we are very excited about Istio and Managed Istio, foundational technologies that will drive the use of containers and microservices, while helping to make your environment much more manageable, scalable and available.

Enterprise-grade Kubernetes, wherever you go

A great path to well-managed applications is undoubtedly containers and microservices, and having a common Kubernetes management layer can help get you there that much faster. Four years ago, we released Kubernetes, and the resulting Kubernetes Engine managed service is battle-tested and growing by leaps and bounds: In 2017 Kubernetes Engine core-hours grew 9X year over year.

Today, we are excited to bring that same managed Kubernetes Engine experience to your on-premise infrastructure. GKE On-Prem, soon to be in alpha, is Google-configured Kubernetes that you can deploy in the environment of your choice. GKE On-Prem makes it easy to install and upgrade Kubernetes and provides access to the following capabilities across GCP and on-premise:
  • Unified multi-cluster registration and upgrade management
  • Centralized monitoring and logging with Stackdriver integration
  • Hybrid Identity and Access Management
  • GCP Marketplace for Kubernetes applications
  • Unified cluster management for GCP and on-premise
  • Professional services and enterprise-grade support
Now, with GKE On-Prem, you can begin to modernize existing applications on-premise, without necessarily moving to the cloud. You gain control of your journey to the cloud at your own pace.

Automatically take control of your Kubernetes workloads

When it comes to managing clusters at scale, it’s imperative to have the right security controls in place and ensure your policies can be easily managed and enforced. Today, we’re pleased to announce GKE Policy Management which delivers centralized capabilities that make it far easier for administrators to configure Kubernetes (wherever it may be running).

With GKE Policy Management, Kubernetes administrators create a single source of truth for their policies that automatically syncs with any enrolled cluster. GKE Policy Management supports policies stored as definitions in a repository, and can also use your existing Google Cloud IAM policies to make it simple to secure your clusters. GKE Policy Management is coming soon to alpha; sign up here to express interest.

A service-centric view of your environment

More than simply making it easier to migrate workloads to the cloud, the technologies found in Cloud Services Platform lay the groundwork for improving service operations, by providing administrators with a service-centric view of their infrastructure, rather than infrastructure views of services. Today, we are announcing Stackdriver Service Monitoring, which provides the following new views:
  • Service graph: A real-time bird’s-eye visualization of the entire environment—see all your microservices, how they communicate, and their dependencies.
  • Service level objective (SLO) monitoring: Monitor and alert in the same customer-centric, low-toil manner as Google Site Reliability Engineers (SRE) do for our own services.
  • Service dashboard: All your signals for a given service are in a single place so that you can debug faster and easier than ever before and lower your mean-time-to-resolution (MTTR).
Stackdriver Service Monitoring is designed for workloads running on opinionated Istio infrastructure, as well as App Engine.

When microservices become APIs

Microservices provide a simple, compelling way for organizations to accelerate moving workloads to the cloud, serving as a path towards a larger cloud strategy. Istio enables service discovery, connection and management for microservices. But as soon as those services are needed for internal groups, partners or developers outside of the enterprise, they quickly cross the line and become APIs.

Just as organizations need services management for microservices, they need API management for their APIs. Apigee API Management complements Istio with the robust features of Google Cloud's Apigee API management platform, Apigee Edge, by extending API management natively into the microservices stack. Apigee Edge features include API usage, access, productization, catalog and discovery, plus a developer portal to create a smooth experience for developers and increase API consumption.

Making cloud all it could be

Here at Google, we could never have done what we do today without containers and Kubernetes, but taking a service-oriented view of our operations has been equally critical. In addition to the core capabilities mentioned above, Cloud Services Platform provides access to other new areas of functionality:
  • GKE serverless add-on lets you run serverless workloads on Kubernetes Engine with a one-step deploy. You can go from source to containers amazingly fast, auto-scale your stateless container-based workloads, and even scale down to zero. Sign up for an early preview for the GKE serverless add-on here.
  • Knative (pronounced kay-nay-tiv), open-source serverless components from the same technology that enables the GKE serverless add-on. Knative lets you create modern, container-based and cloud-native applications by providing building blocks you need to build and deploy container-based serverless applications anywhere on Kubernetes.
  • Cloud Build is a fully-managed Continuous Integration/Continuous Delivery (CI/CD) platform that lets you build, test, and deploy software quickly, at scale.
Now, with Cloud Services Platform, we’re excited to bring the full potential of the cloud to you, wherever your workloads may be. For more on Cloud Services Platform, you can read about how it relates to serverless computing.

Bringing the best of serverless to you

Every business wants to innovate—and deliver—great software, faster. In recent years, serverless computing has changed application development, bringing the focus on the application logic instead of infrastructure. With zero server management, auto-scaling to meet any traffic demands, and managed integrated security, developers can move faster, stay agile and focus on what matters most—building great applications.

Google helped pioneer the notion of serverless more than 10 years ago with the introduction of App Engine. Making developers more productive is just as important today as it was then. Over the past few years, we have been working hard to bring the benefits of serverless that we learned from App Engine to our compute, storage, database, messaging services, data analytics, and machine learning offerings.

Today, in tandem with the launch of our Cloud Services Platform, we are sharing several important developments to our serverless compute stack:
  • New App Engine runtimes
  • Cloud Functions general availability, support for additional languages, plus performance, networking and security features
  • Serverless containers on Cloud Functions
  • GKE serverless add-on
  • Knative, Kubernetes-based building blocks for serverless workloads
  • Integration of Cloud Firestore with GCP services

Expanding serverless compute

Today we are announcing support for new second-generation App Engine standard runtimes such as Python 3.7 and PHP 7.2 in addition to recent support for Node.js 8. Second generation runtimes provide developers idiomatic, open-source language runtimes capable of running any framework, library, or binary. Based on gVisor technology, these new runtimes enable faster deployments and increased application performance.

Also, Cloud Functions, our event-driven compute service, is generally available starting today, complete with predictable service guaranteed by an SLA, and a global footprint with new regions in Europe and Asia. In addition, we are bolstering Cloud Functions with a range of new and heavily requested features including support for Python 3.7 and Node.js 8, networking and security controls, and performance improvements across the board. Cloud Functions also lets you seamlessly connect and extend more than 20 GCP services such as BigQuery, Cloud Pub/Sub, machine learning APIs, G Suite, Google Assistant and many more.

Serverless and containers: the best of both worlds

Whether you’re using App Engine or Cloud Functions, Google’s serverless platform offers a complete mix of tools and services. However, many customers tell us they have custom requirements like specific runtimes, custom binaries, or workload portability. More often than not, they turn to containers for an answer. At Google Cloud, we want to bring the best of both serverless and containers together.

Today, we’re also introducing serverless containers, which allow you to run container-based workloads in a fully managed environment and still only pay for what you use. Sign up for an early preview of serverless containers on Cloud Functions to run your own containerized functions on GCP with all the benefits of serverless.

And what if you are already using Kubernetes Engine? A new GKE serverless add-on lets you run serverless workloads on Kubernetes Engine with a one-step deploy. You can go from source to containers instantaneously, auto-scale your stateless container-based workloads, and even scale down to zero. Here’s what T-mobile had to say about running their serverless workloads on Kubernetes Engine:
"The technology behind the GKE serverless add-on enabled us to focus on just the business logic, as opposed to worrying about overhead tasks such as build/deploy, autoscaling, monitoring and observability"
-Ram Gopinathan, Principal Technology Architect, T- Mobile

With Knative, run your serverless workloads anywhere

While we believe Google Cloud is a great place to run all types of workloads, some customers need to run on-premises or across multiple clouds. Based on this feedback, we’re excited to announce Knative (pronounced kay-nay-tiv), which is an open-source set of components from the same technology that enables the GKE serverless add-on.

Developed in close partnership with Pivotal, IBM, Red Hat, and SAP, Knative pushes Kubernetes-based computing forward by providing the building blocks you need to build and deploy modern, container-based serverless applications.

Knative focuses on the common but challenging parts of running apps, such as orchestrating source-to-container builds, routing and managing traffic during deployment, auto-scaling workloads, and binding services to event ecosystems. Knative provides you with familiar, idiomatic language support and standardized patterns you need to deploy any workload, whether it’s a traditional application, function, or container.

Knative provides reusable implementations of common patterns and codified best practices, shared by successful, real-world Kubernetes-based frameworks and applications. For instance, Knative comes with a build component that provides powerful abstraction and flexible workflow for building, testing, or deploying container images or non-container artifacts on a Kubernetes cluster. By integrating Knative into your own platform, you don’t have to choose between the portability and familiarity of containers and the automation and efficiency of serverless computing. And you can enjoy the benefits of Google Cloud’s extensive experience delivering serverless computing whether you run on GCP, on-premises or in any other cloud. Get started today with Knative or join the conversation.

A comprehensive serverless ecosystem

Of course, serverless computing is a non-starter if you can’t easily build and deploy the code, store your data, and manage your applications in production as part of your overall IT environment. At Google Cloud, we’re committed to enabling the comprehensive ecosystem of serverless offerings.

Cloud Build, for instance, lets you create a continuous integration and delivery (CI/CD) pipeline for your serverless applications. You can define custom workflows for building, testing, and deploying across multiple serverless environment such Cloud Functions, App Engine and even Knative.

Cloud Firestore, one of the most recent additions to our serverless stack, lets you store and sync your app data at global scale. Soon, app developers will be able to easily access Cloud Firestore within the GCP Console, and it will also be compatible with Cloud Datastore.

Finally, our Stackdriver suite has four core capabilities—monitoring, logging, application performance management (APM) and the newly released Service Monitoring—and lets you operate and rapidly diagnose your serverless applications in production.

Toward ubiquitous serverless computing

We’re firm believers in finding ways to simplify operations and bring solutions to market faster. Last week’s launch of commercial Kubernetes applications in GCP Marketplace demonstrates how third-party solutions providers are adopting new technologies rapidly to support enterprise demand for extensible solutions. Now, with these new offerings, we’ll help more developers adopt serverless computing in the languages and platforms of their choice.

Click here to learn about the full breadth of Google Cloud serverless technologies.

Partnering with Intel and SAP on Intel Optane DC Persistent Memory for SAP HANA

Our customers do extraordinary things with their data. But as their data grows, they face challenges like the cost of resources needed to handle and store it, and the general sizing limitations with low latency in-memory computing workloads.

Our customers' use of in-memory workloads with SAP HANA for innovative data management use cases is driving the demand for even larger memory capacity. We’re constantly pushing the boundaries on GCP’s instance sizes and exploring increasingly cost-effective ways to run SAP workloads on GCP.

Today, we’re announcing a partnership with Intel and SAP to offer GCP virtual machines supporting the upcoming Intel® Optane™ DC Persistent Memory for SAP HANA workloads. These GCP VMs will be powered by the future Intel® Xeon® Scalable processors (code-named Cascade Lake) thereby expanding VM resource sizing and providing cost benefits for customers.

Compute Engine VMs with Intel Optane DC persistent memory will offer higher overall memory capacity with lower cost compared to instances with only dynamic random-access memory (DRAM). This will help enable you to scale up your instances while keeping your costs under control. Compute Engine has consistently been focused on decreasing your operational overhead through capabilities such as Live Migration. And coupled with the native persistence benefits of Intel Optane DC Persistent Memory, you’ll get faster restart times for your most critical business applications.

Google Cloud instances on Intel Optane DC Persistent Memory for SAP HANA and other workloads will be available in alpha later this year for customer testing. To learn more, please fill out this form to register your interest.

To learn more about this partnership, visit our Intel and SAP partnership pages.

5 must-see network sessions at Google Cloud NEXT 2018

Whether you’re moving data to or from Google Cloud, or are knee-deep plumbing your cloud network architecture, there’s a lot to learn at Google Cloud Next 2018 next week in San Francisco). Here’s our shortlist of the five must-see networking breakout sessions at the show, in chronological order from Wednesday to Thursday.
Operations engineer, Rebekah Roediger, delivering cloud network capacity one link at a time, in our Netherlands cloud region (europe-west4).

GCP Network and Security Telemetry
Speakers: Ines Envid, Senior Product Manager, Yuri Solodkin, Staff Software Engineer and Vineet Bhan, Head of Security Partnerships
Network and security telemetry is fundamental to operate your deployments in public clouds with confidence, providing the required visibility on the behavior of your network and access control firewalls.
When: July 24th, 2018 12:35pm

A Year in GCP Networking
Speakers: Srinath Padmanabhan, Networking Product Marketing Manager, Google Cloud and Nick Jacques, Lead Cloud Engineer, Target
In this session, we will talk about the valuable advancements that have been made in GCP Networking over the last year. We will introduce you to the GCP Network team and will tell you about what you can do to extract the most value from your GCP Deployment.
When: July 24th, 2018 1:55pm

Cloud Load Balancing Deep Dive and Best Practices
Speakers: Prajakta Joshi, Sr. Product Manager and Mike Columbus, Networking Specialist Team Manager
Google Cloud Load Balancing lets enterprises and cloud-native companies deliver highly available, scalable, low-latency cloud services with a global footprint. You will see demos and learn how enterprise customers deploy Cloud Load Balancing and the best practices they use to deliver smart, secure, modern services across the globe.
When: July 25th, 2018 12:35pm

Hybrid Connectivity - Reliably Extending Your Enterprise Network to GCP
Speaker: John Veizades, Product Manager, Google Cloud
In this session, you will learn how to connect to GCP with highly reliable and secure networking to support extending your data center networks into the cloud. We will cover details of resilient routing techniques, access to Google API from on premise networks, connection locations, and partners that support connectivity to GCP -- all designed to support mission-critical network connectivity to GCP.
When: July 26th, 2018 11:40am

VPC Deep Dive and Best Practices
Speakers: Emanuele Mazza, Networking Product Specialist, Google, Neha Pattan, Software Engineer, Google and Kamal Congevaram Muralidharan, Senior Member Technical Staff, Paypal
This session will walk you through the unique operational advantages of GCP VPC for your enterprise cloud deployments. We’ll go through detailed use cases, how to seal and audit your VPC, how to extend your VPC to on-prem in hybrid scenarios, and how to deploy highly available services.
When: July 26th, 2018 9:00am

Be sure to reserve your spot in these sessions today—space is filling up!

Kubernetes wins OSCON Most Impact Award

Today at the Open Source Awards at OSCON 2018, Kubernetes won the inaugural Most Impact Award, which recognizes a project that has had a ‘significant impact on how software is being written and applications are built’ in the past year. Thank you O’Reilly OSCON for the recognition, and more importantly, thank you to the vast Kubernetes community that has driven the project to where it is today.

When we released Kubernetes just four years ago, we never quite imagined how successful the project would be. We designed Kubernetes from a decade of experience running production workloads at Google, but we didn’t know whether the outside world would adopt it. However we believed that if we remained open to new ideas and new voices, the community would provide feedback and contributions to move the project forward to meet the needs of users everywhere.

This openness led to Kubernetes’ rapid adoption—and it’s also one of the core pillars of Google Cloud: our belief in an open cloud, so that you can pick-up and move your app wherever you want. Whether it’s Tensorflow, an open source library for machine learning, Asylo, a framework for confidential computing, or Istio, an open platform to connect microservices, openness remains a core value here at Google Cloud.

To everyone who has helped make Kubernetes the success it is today, many thanks again.

If you haven’t tried Kubernetes, it’s easy to get started with using Google Kubernetes Engine. If you’re interested to learn more about Kubernetes and the ecosystem it spawned, then subscribe to the Kubernetes Podcast from Google to hear weekly insights from leaders in the community.

VMware and Google Cloud: building the hybrid cloud together with vRealize Orchestrator

Many of our customers with hybrid cloud environments rely on VMware software on-premises. They want to simplify provisioning and enable end-user self service. At the same time, they also want to make sure they’re complying with IT policies and following IT best practices. As a result, many use VMware vRealize Automation, a platform for automated self-service provisioning and lifecycle management of IT infrastructure, and are looking for ways to leverage it in the cloud.

Today, we’re announcing the preview of our plug-in for VMware vRealize Orchestrator and support for Google Cloud Platform (GCP) resources in vRealize Automation. With these resources, you can now deploy and manage GCPresources from within your vRealize Automation environment.

The GCP plug-in for VMware vRealize Orchestrator provides a consistent management and governance experience across on-premises and GCP-based IT environments. For example, you can use Google-provided blueprints or build your own blueprints for Google Compute Engine resources and publish to the vRealize service catalog. This means you can select and launch resources in a predictable manner that is similar to how you launch VMs in your on-premises VMware environment, using a tool you’re already familiar with.

This preview release allows you to:
  • Create vRealize Automation “blueprints” for Compute Engine VM Instances
  • Request and self-provision resources in GCP using vRA’s catalog feature
  • Gain visibility and reclaim resources in GCP to reduce operational costs
  • Enforce access and resource quota policies for resources in GCP
  • Initiate Day 2 operations (start, stop, delete, etc.) on Compute Engine VM Instances, Instance Groups and Disks
The GCP plug-in for vRealize makes it easy for you to unlock new hybrid scenarios. For example:

  1. Reach new regions to address global business needs. (Hello Finland, Mumbai and Singapore.)
  2. Define large-scale applications using vRA and deploy to Compute Engine to leverage GCP’s worldwide load balancing and automatic scaling.
  3. Save money by deploying VMs as Compute Engine Preemptible VM Instances and using Custom Machine Types to tailor the VM configuration to application needs.
  4. Accelerate the time it takes to train a machine learning model by using Compute Engine with NVIDIA® Tesla® P100 GPUs.
  5. Replicate your on premises-based applications to the cloud and scale up or down as your business dictates.
While this preview offers support for Compute Engine Virtual Machines in vRealize Automation, we’re working together with VMware to add support for additional GCP products such as Cloud TPUs—we’ll share more on that in the coming months. You can also find more information about this announcement by reading VMware’s blog.

In the meantime, to join the preview program, please submit a request using the preview intake form.

SRE fundamentals: SLIs, SLAs and SLOs

Next week at Google Cloud Next ‘18, you’ll be hearing about new ways to think about and ensure the availability of your applications. A big part of that is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does day in and day out here at Google. Our SRE principles have as their end goal to improve services and in turn the user experience, and next week we’ll be discussing some new ways you can incorporate SRE principles into your operations.

In fact, a recent Forrester report on infrastructure transformation offers details on how you can apply these SRE principles at your company—more easily than you might think. They found that enterprises can apply most SRE principles either directly or with minor modification.

To learn more about applying SRE in your business, we invite you to join Ben Treynor, head of Google SRE, who will be sharing some exciting announcements and walking through real-life SRE scenarios at his Next ‘18 Spotlight session. Register now as seats are limited.

The concept of SRE starts with the idea that metrics should be closely tied to business objectives. We use several essential measurements—SLO, SLA and SLI—in SRE planning and practice.

Defining the terms of site reliability engineering

These measurements aren’t just useful abstractions. Without them, you cannot know if your system is reliable, available or even useful. If they don’t tie explicitly back to your business objectives, then you don’t have data on whether the choices you make are helping or hurting your business.

As a refresher, here’s a look at the key measurements of SRE, as discussed by AJ Ross, Adrian Hilton and Dave Rensin of our Customer Reliability Engineering team, in the January 2017 blog post, SLOs, SLIs, SLAs, oh my - CRE life lessons.

1. Service-Level Objective (SLO)

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

When we set out to define the terms of SRE, we wanted to set a precise numerical target for system availability. We term this target the Service-Level Objective (SLO) of our system. Any discussion we have in the future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.

Keep in mind that the more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with for each service, and state that as your SLO. Every service should have an SLO—without it, your team and your stakeholders cannot make principled judgments about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development). Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable.

Within Google, we implement periodic downtime in some services to prevent a service from being overly available. You might also try experimenting with planned-downtime exercises with front-end servers occasionally, as we did with one of our internal systems. We found that these exercises can uncover services that are using those servers inappropriately. With that information, you can then move workloads to somewhere more suitable and keep servers at the right availability level.

2. Service-Level Agreement (SLA)

At Google, we distinguish between an SLO and a Service-Level Agreement (SLA). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they will push hard to stay within SLA. If you’re charging your customers money, you will probably need an SLA.

Because of this, and because of the principle that availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over one month, with an internal availability SLO of 99.95%. Alternatively, the SLA might only specify a subset of the metrics that make up the SLO.

If you have an SLA that is different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You will also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries. That’s another benefit of establishing an SLA—it’s an unambiguous way to prioritize traffic.

When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, if a customer goes over quota because they released a buggy version of their mobile client, you may consider excluding all “out of quota” response codes from your SLA accounting.

3. Service-Level Indicator (SLI)

We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries; these will form the basis of your SLIs.

Since the original post was published, we’ve made some updates to Stackdriver that let you incorporate SLIs even more easily into your Google Cloud Platform (GCP) workflows. You can now combine your in-house SLIs with the SLIs of the GCP services that you use, all in the same Stackdriver monitoring dashboard. At Next ‘18, the Spotlight session with Ben Treynor and Snapchat will illustrate how Snap uses its dashboard to get insight into what matters to its customers and map it directly to what information it gets from GCP, for an in-depth view of customer experience.
Automatic dashboards in Stackdriver for GCP services enable you to group several ways: per service, per method and per response code any of the 50th, 95th and 99th percentile charts. You can also see latency charts on log scale to quickly find outliers.  

If you’re building a system from scratch, make sure that SLIs and SLOs are part of your system requirements. If you already have a production system but don’t have them clearly defined, then that’s your highest priority work. If you’re coming to Next ‘18, we look forward to seeing you there.

See related content:

Bringing GPU-accelerated analytics to GCP Marketplace with MapD

Editor’s note: Today, we hear from our partner MapD, whose data analytics platform uses GPUs to accelerate queries and visualizations. Read on to learn how MapD and Google Cloud are working together.

MapD and public cloud are a great fit. Combining cloud-based GPU infrastructure with MapD’s performance, interactivity and operational ease of use is a big win for our customers, allowing data scientists and analysts to visually explore billion-row datasets with fluidity and minimal hassle.

Our Community and Enterprise Edition images are available on AWS, MapD docker containers are available on NVIDIA GPU Cloud (NGC), as well as our own MapD Cloud. Today, we’re thrilled to announce the availability of MapD on Google Cloud Platform (GCP) Marketplace, helping us bring interactivity at scale to the widest possible audience. With services like Cloud DataFlow, Cloud BigTable and Cloud AI, GCP has emerged as a great platform for data-intensive workloads. Combining MapD and these services let us define scalable, high-performance visual analytics workflows for a variety of use cases.

On GCP, you’ll find both our Community and Enterprise editions for K80, Pascal and Volta GPU instances in the GCP Marketplace. Google’s flexible approach to attaching GPU dies to standard CPU-based instance types means you can dial up or down the necessary GPU capacity for your instances depending on the size of your datasets and your compute needs.

We’re confident that MapD’s availability on GCP marketplace will further accelerate the adoption of GPUs as a key part of enterprise analytics workloads, in addition to their obvious applicability to AI, graphics and general purpose computing. Click here to try out MapD on GCP.

Improving our account management policies to better support customers

Recently, a Google Cloud Platform (GCP) customer blogged about an incident in June, in which a project they were running on Google Cloud Platform was suspended. We really appreciated the candid feedback, in which our customer noted several aspects of our account management process which needed to be improved. We also appreciated our customer’s followup and recognition of the Google Cloud support team, “who have reached out and assured us these incidents will not repeat.”

Here’s what we are doing to be as good as our word, and provide a more careful, accurate, thoughtful and empathetic account management experience for our GCP customers. These changes are intended to provide peace of mind and a predictable, positive experience for our GCP customers, while continuing to permit appropriate suspension and removal actions for the inevitable bad actors and fraud which are a part of operating a public cloud service.

No Automatic Fraud-Based Account Suspensions for Established Customers with Offline Payment. Established GCP customers complying with our Acceptable Use Policy (AUP, TOS and local laws), with an offline billing contract, invoice billing, or an active relationship with our sales team, are not subject to fraud-based automatic account suspension.

Delayed Suspension for Established Online Customers. Online customers with established payment history, operating in compliance with our TOS, AUP and local laws, will receive advance notification and a 5 day cure period in the event that fraud or account compromise activity is detected in their projects.

Other Customers. For all other customers, we will institute a second human review for flagged fraud accounts prior to suspending an account. We’re also modifying who has authority to suspend an account, as well as refreshing our training for the teams that review flagged accounts and determine response actions; re-evaluating the signals, sources, and the tools we use to assess potential fraudulent activity; and increasing the number of options we can use to help new customers quickly and safely grow their usage while building an account history with Google.

In addition to the above, for all customers we are making the following improvements:

24X7 Chat Support. We are rolling out 24X7 chat support for customers that receive account notices, so that customers can always reach us easily. We expect this to be fully rolled out for all customers by September.

Correcting Notices About our 30 Day Policy. Our customer noted, with appropriate concern, that their suspension email stated “we will delete your account in 3 days.” This language was simply incorrect -- our fraud suspension policy provides 30 days before removal. We have corrected the communication language, and we are conducting a full review of our communication verbiage and systems and ensuring that our messages are accurate and clear.

Updating Our Project Suspension Guidelines. We will review and update our project suspension guidelines to clarify our practices and describe what you should expect from Google.

Improving Customer Contact Points. We will encourage customers to provide us with a verifiable phone number, email, and other contact channels, both at sign-up and at later points in time, so that we can quickly contact you if we detect suspicious activity on your account.

Creating Customer Pre-Verification. We will provide ways for customers to pre-verify their accounts with us if they desire, either at sign-up or at a later point in time.

These suspensions are our responsibility.There are also steps that customers can take to help us protect their accounts including:
  1. Make sure to monitor emails sent to your payments and billing contacts so you don’t miss important alerts.
  2. Provide a valid phone number where we can reach you in the event of suspicious activity on your account.
  3. Provide one or more billing admins to your account.
  4. Provide a secondary payment method in case there are problems charging your primary method.
  5. Contact our sales team to see if you qualify for invoice billing instead of relying on credit cards.
We’re making immediate changes to ensure our policies will improve our customer’s experience. Our work here is never done and we will continue to update and optimize based on your feedback.

We sincerely apologize to all our customers who’ve been concerned or had to go through a service reinstatement. Please keep the feedback coming, we’ll work to continue to earn your trust every day.