Tag Archives: Google Cloud Platform

How to deploy geographically distributed services on Kubernetes Engine with kubemci



Increasingly, many enterprise Google Cloud Platform (GCP) customers use multiple Google Kubernetes Engine clusters to host their applications, for better resilience, scalability, isolation and compliance. In addition, their users expect to low-latency access to applications from anywhere around the world. Today we are introducing a new command-line interface (CLI) tool called kubemci to automatically configure ingress using Google Cloud Load Balancer (GCLB) for multi-cluster Kubernetes Engine environments. This allows you to use a Kubernetes Ingress definition to leverage GCLB along with multiple Kubernetes Engine clusters running in regions around the world, to serve traffic from the closest cluster using a single anycast IP address, taking advantage of GCP’s 100+ Points of Presence and global network. For more information on how the GCLB handles cross-region traffic see this link.

Further, kubemci will be the initial interface to an upcoming controller-based multi-cluster ingress (MCI) solution that can adapt to different use-cases and can be manipulated using the standard kubectl CLI tool or via Kubernetes API calls.

For example, in the picture below, we have created three independent Kubernetes Engine clusters and spread them across three continents (Asia, North America, and Europe). We then deployed the same service, “zone-printer”, to each of these clusters and used kubemci to create a single GCLB instance to stitch the services together. In this case, the 1000 requests-per-second (rps) from Tokyo are routed to the cluster in Asia, the New York requests are routed to the North American cluster, and the remaining 1 rps from London is routed to the European cluster. Because each of these requests arrive at the closest cluster to the end user they benefit from low round-trip latency. Additionally, if a region, cluster, or service were ever to become unavailable, GCLB automatically detects that and routes users to one of the other healthy service instances.

The feedback on kubemci has been great so far. Marfeel is a Spanish ad tech platform and has been using kubemci in production to improve their service offering:
“At Marfeel, we appreciate the value that this tool provides for us and our customers. Kubemci is simple to use and easily integrates with our current processes, helping to speed up our Multi-Cluster deployment process. In summary, kubemci offers us granularity, simplicity, and speed.”
-Borja García - SRE Marfeel

Getting started

To get started with kubemci, please check out the how-to guide, which contains information on the prerequisites along with step-by-step instructions on how to download the tool and set up your clusters, services and ingress objects.

As a quick preview, once your applications and services are running, you can set up a multi-cluster ingress by running the following command:
$ kubemci create my-mci --ingress=ingress.yaml \
    --kubeconfig=cluster_list.yaml
To learn more, check out this talk on Multicluster Ingress by Google software engineers Greg Harmon and Nikhil Jindal, at KubeCon Europe in Copenhagen, demonstrating some initial work in this space.

Regional clusters in Google Kubernetes Engine are now generally available



Editor's note: This is one of many posts on enterprise features you’ll find in Kubernetes Engine 1.10. For the full coverage, follow along here.

A highly available Kubernetes cluster is a key requirement for most production applications. However, adding this protection can be complex. We’ve consistently heard from Kubernetes users that creating and managing a high-availability Kubernetes cluster is no small feat. Keeping etcd (the key-value store) replicas in sync across zones, scaling your masters, and ensuring that your control plane is fronted by a resilient load balancer are just some of the challenges users face when maintaining their own highly available cluster.

Today we’re proud to announce the general availability of one of Google Kubernetes Engine's most requested enterprise-grade features: regional clusters. Regional clusters create a multi-master, highly-available Kubernetes cluster that spreads both the control plane and the nodes across multiple zones in the same region, allowing us to increase the control plane uptime to 99.95%. In addition to the increased availability, regional clusters give you a zero-downtime upgrade experience, so that your cluster is always available for deployments.

We’ve seen rapid adoption of regional clusters since we announced the beta, and with many users already running production workloads using regional clusters. In addition, we are pleased to announce today that regional clusters in Kubernetes Engine are available at no additional cost.

Get started with regional clusters

You can quickly create your first regional cluster using the Cloud Console or the gcloud command line tool.
$ gcloud container clusters create my-regional-cluster
--region=us-east1 --num-nodes=2
This creates a regional cluster in us-east1 with two nodes in each of the us-east1 zones.

By creating a regional cluster, you get:
  • Resilience from single zone failure - Because your masters and application nodes are available across a region rather than a single zone, your Kubernetes cluster is still fully functional if an entire zone goes down.
  • No downtime during master upgrades - Kubernetes Engine minimizes downtime during all Kubernetes master upgrades, but with a single master, some downtime is inevitable. By using regional clusters, the control plane remains online and available, even during upgrades.
Regional clusters is just one of the many features that makes Kubernetes Engine a great choice for enterprises seeking to run a production-grade, managed Kubernetes cluster in the cloud. For a more detailed explanation of the regional clusters feature along with additional flags you can use, check out the documentation.

Last month today: GCP in May



When it comes to Google Cloud Platform (GCP), every month is chock full of news and information. We’re kicking off a monthly recap of key moments you may have missed.

What caught your attention this month:

Announcements about open source projects were some of the most-read this month.
  • Open-sourcing gVisor, a sandboxed container runtime was by far your favorite post in May. gVisor is a sandbox that lets you run containers in strongly isolated environments. It’s isolated like a virtual machine, but more lightweight and also more flexible, since it interfaces with the host OS just like another process.
  • Our introduction of Asylo, an open-source framework for confidential computing, also got your attention. As more and more sensitive workloads move to cloud, lots of businesses want to be able to verify that they’re properly isolated, inside a closed environment that’s only available to authorized users. Asylo democratizes trusted execution environments (TEEs) by allowing them to run on generic hardware. With Asylo, developers will be able to run their workloads encrypted in a highly secure environment, whether it’s on-premises or in the cloud.
  • Rounding out the open-source fun for the month was our introduction of the beta availability of Cloud Memorystore, a fully managed in-memory data store service for Redis. Cloud Memorystore gives you the caching power of Redis to reduce latency, without having to manage the details.


Hot topics: Kubernetes, DevOps and SRE

Google Kubernetes Engine 1.10 debuted in May, and we had a lot to say about the new features that this version enables—from security to brand-new monitoring functionality via Stackdriver Kubernetes Monitoring to networking. Start with this post to see what’s new and how customers like Spotify are using Kubernetes Engine on Google Cloud.

And one of our recent posts also struck a chord, as two of our site reliability engineering (SRE) experts delved into the differences—and similarities—between SRE and DevOps. They have similar goals, mostly around creating flexible, agile dev environments, but SRE generally gets much more specific and prescriptive than DevOps in accomplishing them.

Under the radar: GCP adds infrastructure options

As you look for new ways to use GCP to run your business, our engineers are adding features and new releases to give you more power, resources and coverage.

First, we introduced ultramem Google Compute Engine machine types, which offer more memory and compute resources than any other Compute Engine VM instance. These machines types are especially useful for those of you running enterprise workloads that need a lot of memory, like data analytics or high-performance applications.

We’ve also been busy on the back-end in other ways too, as we continue adding new regional cloud computing infrastructure. Our third zone of the Singapore region opened in May, and we’ll open a Zurich region next year.

Stay tuned in June for more on the technologies behind Google Cloud—we’ve got lots up our sleeve.

7 tips to maintain security controls in your GCP DR environment



Cloud computing has changed many traditional IT practices, and one particularly useful change has been in the area of disaster recovery (DR). Our team helps Google Cloud Platform (GCP) users build their infrastructures with cloud, and we’ve seen some great results when they use GCP as the DR target environment.

When you integrate a cloud provider like GCP into your DR plan, you no longer have to invest up front in mostly idle backup infrastructure. Testing that DR plan no longer seems so daunting, as you can bring up your DR environment automatically and close it all down again when it’s no longer needed—and it’s always ready for the next tests. In the event of an actual disaster, the DR environment can be made ready.

However, planning and testing for and recovering from a disaster involves more than just getting your application restored and available within your Recovery Time Objective (RTO). You need to ensure the security controls you implemented on-premises also apply to your recovered environment. This post provides tips to help you maintain your security controls in your cloud DR environment.

1. Grant users the same access they’re used to

If your production environment is running on GCP, it’s easy to replicate the identity and access management (IAM) policies already in place. Use infrastructure as code methods and employ tools such as Cloud Deployment Manager to recreate your IAM policies. You can then bind the policies to corresponding resources when you’re standing up the DR environment.

If your production environment is on-premises, map functional roles such as your network administrator and auditor roles to appropriate IAM policies. Use the IAM documentation, which has some example configurations such as the networking and audit logging functional roles. You’ll want to configure IAM policies to grant appropriate permissions to GCP products. For example, you might want to restrict access to specific Google Cloud Storage buckets.

Once you’ve created a test environment, ensure that the access granted to your users confers the same permissions as those they are granted on-premises.

If your production environment runs on another cloud, you will need to understand how to map the IAM controls to Google Cloud Identity and Access Management (IAM) policies. The GCP for AWS professionals management doc can help if your other cloud is AWS.

2. Ensure user understanding

Do not wait for a disaster to occur before checking that your users—whether developers, operators, data scientists, security or network admins—can access the DR environment. Make sure their accounts have been granted the appropriate access rights. If you are using an alternative identity system, ensure accounts have been synced with your Cloud Identity account. Make sure users who will need access to the DR environment (which will be the production environment for awhile) are able to log in to the DR environment. Resolve any authentication issues. When you’re conducting regular DR tests, incorporate users logging into the DR environment as part of the test process.

Enable the GCP OS login feature on the projects that constitute your DR environment so you can centrally manage who has SSH access to VMs launched in the DR environment.

It’s also important to train users on the DR environment so that they understand how to undertake their usual actions in GCP. Use the test environment for this.

3. Double-check blocking and compliance requirements

It’s essential that your network controls confer the same separation and blocking settings in the DR as the source production environment. Learn how to configure Shared VPCs and GCP firewalls and take advantage of using service accounts as part of the firewall rules. Understand how to use service accounts to implement least privileges for applications accessing GCP APIs.

In addition, make sure your DR environment meets your compliance requirements. Validate that access to your DR environment is restricted to only those who need access. Ensure personally identifiable information (PII) data is appropriately redacted and encrypted. If you carry out regular penetration tests on your production environment, you should start including your DR environment and carry out those tests by regularly standing up a DR environment.

4. Capture log data

When the DR environment is in service, the logs collected during that time should be backfilled into your production environment log archive. Ensure that as part of your GCP DR environment you are able to export audit logs that are collected via Stackdriver to your main log sink archive. Use the export sink facilities. For application logs, stand up a mirror of your on-premises logging and monitoring environment. For another cloud, map across to the equivalent GCP services. Have a process in place to format this log input into your production environment.

5. Use Cloud Storage for day-to-day backup routines

Use Cloud Storage to store backups that result from the DR environment. Ensure the storage buckets containing your backups have limited permissions applied to them.

The same security controls should apply to your recovered data as if it were production data. The same permissions, encryption and audit requirements apply. Know where your backups are located and where, and who restored them.

6. Consider secrets and key management

Manage application-level secrets and keys using GCP to host the key or secret management service. You can use Cloud Key Management Service (KMS) or a third-party solution like HashiCorp Vault with a GCP backend such as Cloud Spanner or Cloud Storage.

7. Manage VM images and snapshots

If you have predefined configurations for VMs, such as who can use them or make changes, reflect those controls in your GCP DR recovery site. Follow the guidance outlined in restricting access to images.

These tips can help make sure you don’t lose the security you’ve built into your production environment when you stand up a DR site. You’ll be on your way more quickly to having a cloud DR site that works for your users and the business.

Next steps:

Read our guide on How to Design a DR plan.

Related content:

Cloud Identity-Aware Proxy: a simple and more secure way to manage application access
Know thy enemy: how to prioritize and communicate risks - CRE life lessons
Building trust through Access Transparency

Kubernetes best practices: upgrading your clusters with zero downtime



Editor’s note: Today is the final installment in a seven-part video and blog series from Google Developer Advocate Sandeep Dinesh on how to get the most out of your Kubernetes environment.

Everyone knows it’s a good practice to keep your application up to date to optimize security and performance. Kubernetes and Docker can make performing these updates much easier, as you can build a new container with the updates and deploy it with relative ease.

Just like your applications, Kubernetes is constantly getting new features and security updates, so the underlying nodes and Kubernetes infrastructure need to be kept up to date as well.

In this episode of Kubernetes Best Practices, let’s take a look at how Google Kubernetes Engine can make upgrading your Kubernetes cluster painless!

The two parts of a cluster

When it comes to upgrading your cluster, there are two parts that both need to be updated: the masters and the nodes. The masters need to be updated first, and then the nodes can follow. Let’s see how to upgrade both using Kubernetes Engine.

Upgrading the master with zero downtime
Kubernetes Engine automatically upgrades the master as point releases are released, however it usually won’t upgrade to a new version (for example, 1.7 to 1.8) automatically. When you are ready to upgrade to a new version, you can just click the upgrade master button in the Kubernetes Engine console.

However, you may have noticed that the dialog box says the following:

“Changing the master version can result in several minutes of control plane downtime. During that period you will be unable to edit this cluster.”

When the master goes down for the upgrade, deployments, services, etc. continue to work as expected. However, anything that requires the Kubernetes API stops working. This means kubectl stops working, applications that use the Kubernetes API to get information about the cluster stop working, and basically you can’t make any changes to the cluster while it is being upgraded.

So how do you update the master without incurring downtime?


Highly available masters with Kubernetes Engine regional clusters

While the standard “zonal” Kubernetes Engine clusters only have one master node backing them, you can create “regional” clusters that provide multi-zone, highly available masters.

When creating your cluster, be sure to select the “regional” option:

And that’s it! Kubernetes Engine automatically creates your nodes and masters in three zones, with the masters behind a load-balanced IP address, so the Kubernetes API will continue to work during an upgrade.

Upgrading nodes with zero downtime

When upgrading nodes, there are a few different strategies you can use. There are two I want to focus on:
  1. Rolling update
  2. Migration with node pools
Rolling update
The simplest way to update your Kubernetes nodes is to use a rolling update. The is the default upgrade mechanism Kubernetes Engine uses to update your nodes.

A rolling update works in the following way. One by one, a node is drained and cordoned so that there are no more pods running on that node. Then the node is deleted, and a new node is created with the updated Kubernetes version. Once that node is up and running, the next node is updated. This goes on until all nodes are updated.

You can let Kubernetes Engine manage this process for you completely by enabling automatic node upgrades on the node pool.

If you don’t select this, the Kubernetes Engine dashboard alerts you when an upgrade is available:

Just click the link and follow the prompt to begin the rolling update.

Warning: Make sure your pods are managed by a ReplicaSet, Deployment, StatefulSet, or something similar. Standalone pods won’t be rescheduled!

While it’s simple to perform a rolling update on Kubernetes Engine, it has a few drawbacks.

One drawback is that you get one less node of capacity in your cluster. This issue is easily solved by scaling up your node pool to add extra capacity, and then scaling it back down once the upgrade is finished.

The fully automated nature of the rolling update makes it easy to do, but you have less control over the process. It also takes time to roll back to the old version if there is a problem, as you have to stop the rolling update and then undo it.

Migration with node pools
Instead of upgrading the “active” node pool as you would with a rolling update, you can create a fresh node pool, wait for all the nodes to be running, and then migrate workloads over one node at a time.

Let’s assume that our Kubernetes cluster has three VMs right now. You can see the nodes with the following command:
kubectl get nodes
NAME                                        STATUS  AGE
gke-cluster-1-default-pool-7d6b79ce-0s6z    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-9kkm    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-j6ch    Ready   3h


Creating the new node pool
To create the new node pool with the name “pool-two”, run the following command:
gcloud container node-pools create pool-two

Note: Remember to customize this command so that the new node pool is the same as the old pool. You can also use the GUI to create a new node pool if you want.

Now if you check the nodes, you will notice there are three more nodes with the new pool name:
$ kubectl get nodes
NAME                                        STATUS  AGE
gke-cluster-1-pool-two-9ca78aa9–5gmk        Ready   1m
gke-cluster-1-pool-two-9ca78aa9–5w6w        Ready   1m
gke-cluster-1-pool-two-9ca78aa9-v88c        Ready   1m
gke-cluster-1-default-pool-7d6b79ce-0s6z    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-9kkm    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-j6ch    Ready   3h

However, the pods are still on the old nodes! Let’s move them over.

Drain the old pool
Now we need to move work to the new node pool. Let’s move over one node at a time in a rolling fashion.

First, cordon each of the old nodes. This will prevent new pods from being scheduled onto them.

kubectl cordon <node_name>
Once all the old nodes are cordoned, pods can only be scheduled on the new nodes. This means you can start to remove pods from the old nodes, and Kubernetes automatically schedules them on the new nodes.

Warning: Make sure your pods are managed by a ReplicaSet, Deployment, StatefulSet, or something similar. Standalone pods won’t be rescheduled!

Run the following command to drain each node. This deletes all the pods on that node.

kubectl drain <node_name> --force


After you drain a node, make sure the new pods are up and running before moving on to the next one.

If you have any issues during the migration, uncordon the old pool and then cordon and drain the new pool. The pods get rescheduled back to the old pool.

Delete the old pool
Once all the pods are safely rescheduled, it is time to delete the old pool.

Replace “default-pool” with the pool you want to delete.

gcloud container node-pools delete default-pool


You have just successfully updated all your nodes!

Conclusion


By using Kubernetes Engine, you can keep your Kubernetes cluster up to date with just a few clicks.

If you are not using a managed service like Kubernetes, you can still use the rolling update or node pools method with your own cluster to upgrade nodes. The difference is you need to manually add the new nodes to your cluster, and perform the master upgrade yourself, which can be tricky.

I highly recommend using Kubernetes Engine regional clusters for the high-availability masters and automatic node upgrades to have a hassle-free upgrade experience. If you need the extra control for your node updates, using node pools gives you that control without giving up the advantages of a managed Kubernetes platform that Kubernetes Engine gives you.

And thus concludes this series on Kubernetes best practices. If you have ideas for other topics you’d like me to address in the future, you can find me on Twitter. And if you’re attending Google Cloud Next ‘18 this July, be sure to drop by and say hi!

Troubleshooting tips: How to talk so your cloud provider will listen (and understand)



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and as a teaser, this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part one of two. Then, check out the second installment specifically on troubleshooting cloud provider communications.

Effective technology troubleshooting requires a systematic approach, as opposed to luck or experience. Troubleshooting can be a learned skill, as discussed in our site reliability engineering (SRE) troubleshooting primer.

But how does that change when you and your operations team are running systems and services on cloud infrastructure? Regardless of where your websites or apps live, you’re the one getting paged when the site goes down and you are the one under pressure to solve the problems and answer questions.

Cloud presents a new way of working for IT teams shifting away from legacy systems. You had full visibility and control of all aspects of your system when it was on-premises, but now you depend on off-site, cloud-based infrastructure, into which you may have limited visibility or understanding. That’s true no matter how top-notch your systems and processes are and how great a troubleshooter you are. You simply can’t see much into many cloud provider systems. The troubleshooting challenge is no longer limited to pure debugging; you also need to effectively communicate with your cloud provider. You’ll need to engage each provider’s support process and engineers to find where problems originated and fix them as soon as possible. This is an area of opportunity for IT teams as they gain new skills and adapt to the cloud-based technology model.

It’s definitely possible to do cloud troubleshooting right. When you’re choosing a provider, make sure to understand how they’ll work with you during any issues. Once you’re working with a provider, you have more control than you think over how you communicate issues and speed along their resolution. We propose this model, inspired by the SRE model of actionable improvements, for working with your cloud provider to troubleshoot more effectively and efficiently. (For more on cloud provider communications throughout the troubleshooting process, see the companion post.)

Understand your cloud provider's support workflow

Your goal here is to figure out the best way to provide the right information to your provider when an issue inevitably arises. This is a useful place to start, especially since you may depend on multiple cloud providers to run your infrastructure. Your interaction with cloud support will typically begin with questions related to migration. Your relationship then progresses into the domain of production integration, and finally, joint troubleshooting of production problems.

Keep in mind that different cloud providers have different philosophies when it comes to customer interaction. Some provide a large degree of freedom and little direct support. They expect you to find answers to most of your questions from online forums such as Stack Overflow. Other providers emphasize tight customer integration and joint troubleshooting. You have some homework to do before you begin serving real customer traffic from your cloud deployment. Talk to your cloud provider's salespeople, project managers and support engineers, to get a sense of how they approach support. Ask them the following questions:
  • What does the lifecycle of a typical support issue report look like?
  • What is your internal escalation process if an issue becomes complex or critical?
  • Do you have an internal SLO for <service name>? If so, what is that SLO?
  • What types of premium support are available?
This step is critical in reducing your frustration when you have to troubleshoot an issue that involves a cloud provider’s giant black box.

Communicate with cloud provider support efficiently

Once you have a sense of the provider’s support workflow, you can figure out the best way to get your information across. There are some best practices for filing the perfect issue report with cloud provider support teams, including what to say in your issue report and why. These guidelines follow the 80/20 rule: we try to give 20% of the details that will be useful in 80% of your issue reports. The same principles apply if you're filing bug reports in issue trackers or posting to user groups and forums.

Your guiding principle when communicating to cloud providers should be clarity: specify the appropriate level of technical detail and communicate expectations explicitly.

Provide basic information
It may seem like common sense, but these basics are essential to include in an issue report. Failing to provide any of these details leads to delays and a poor experience.

Include four critical details
Effective troubleshooting starts with time, product, location and specific identifiers.

1. Time
Here are a few examples of including time in an issue report:
  • Starting at 2017-09-08 15:13 PDT and ending 5 minutes later, we observed...
  • Observed intermittently, starting no earlier than 2017-09-10 and observed 2-5 times...
  • Ongoing since 2017-09-08 15:13 PDT...
  • From 2017-09-08 15:13 PDT to 2017-09-08 22:22 PDT...
Including the onset time and duration allow support teams to focus their time-series monitoring on the relevant period. Be explicit about whether the issue is ongoing or whether it was observed only in the past. If the issue is not ongoing, be explicit about that fact and provide an end time, if possible.

Remember to always include the time zone. ISO 8601 format is a good choice because it is unambiguous and easy to sort. If you instead specify time in a relative way, whomever you're working with must convert local time into an absolute format they can input into time-series monitoring tools. This conversion is error-prone and costly: for example, sending an email about something that happened "earlier yesterday" means that your counterpart has to search for an email header and start doing mental math. This introduces cognitive load, which decreases the person’s mental energy available for solving the technical problem.

If an issue was intermittent over some period of time, state when it was first observed, or perhaps note a time in the past when it was surely not happening. Include the frequency of observations and note one or two specific examples.

Meanwhile, here are some antipatterns to avoid:
  • Earlier today: Not specific enough
  • Yesterday: Requires the recipient to figure out the implied date; can be confusing especially when work crosses the International Date Line
  • 9/8: Ambiguous, as the date might be interpreted as September in United States, or August in other locales. Use ISO 8601 format for clarity.
2. Product
Be as specific as possible in your issue report about the product you're using, including version information where applicable. These, for example, aren’t specific enough to locate the components or logs that can help with diagnosis:
  • REST API returned errors...
  • The data mining query interface is hanging...
Ideally, you should refer to specific APIs or URLs, or include screenshots. If the issue originates in a specific component or tool (for example, the CLI or Terraform), clearly identify that tool. If multiple products are involved, be specific about each one.Describe the behavior you're observing, and the behavior you expected to occur.

Antipatterns:
  • Can't create virtual machine: It's not clear how you're attempting to create those machine, nor does it say what the failure mode is.
  • The CLI command is giving an error:
    • Instead, provide the specific error, and the command syntax so others can run the command themselves.
    • Better: I ran 'mktool create my-instance --zone us-central1' and got the following error message...
3. Location
It's important to specify the region and zone because cloud providers often roll out changes to one region or zone at a time. Therefore, region or zone is a proxy for a cloud-internal software version number.
These are examples of how you might include location information in an issue report:
  • In us-east1-a... 
  • I tried regions eu-west-1 and eu-west-3...
Given this information, the support team can see if there's a rollout underway in a given location, or map your issue to an internal release ID for use in an internal bug report.

4. Specific identifiers
Project identifiers are included in many troubleshooting tools. Specify whether you observed the error in multiple projects, or in one project but not another.

These are examples of specific identifiers:
  • In project 123412341234 or my-project-id... 
  • Across multiple projects (including 123412341234)... 
  • Connecting to cloud external IP 218.239.8.9 from our corporate gateway 56.56.56.56... 
IP addresses are another form of unambiguous identifiers, though they also require additional details when used in an issue report. When specifying an IP, try to describe the context of how it's used. For example, specify whether the IP is connected to a VM instance, a load balancer or a custom route, or if it's an API endpoint. If the IP address isn't part of the cloud platform (for example, your home internet, a VPN endpoint or an external monitoring system), specify that information. Remember, 192.168.0.1 occurs many times in the world.

Antipatterns:
  • One of our instances is unreachable…: Overly vague 
  • We can't connect from the Internet...: Overly vague 
Note that other models, such as the Five Ws (popular in fields like journalism) can provide structure to your report.

Specify impact and response expectations
Basic information to include in your issue report to a cloud provider should also include how it’s affecting your business and when it needs to be resolved.

Priority expectations
Cloud provider support commonly uses the priority you specify to initially route the issue report and to determine the urgency of the issue. The priority rating drives the speed of response, potentially paging oncall personnel.

In addition to selecting the appropriate priority, it's useful to add a sentence describing the impact. Help avoid incorrect assumptions by being explicit about why you selected P1.

Think of the priority in terms of the impact to your business. Your cloud provider may have what appear to be strict definitions of priority (e.g., P1 signified a total outage). Don't let these definitions slow progress on issues that are business critical; extrapolate impact if your issue isn't addressed, or describe the worst case scenario of an exposure related to the issue. For example, the following two descriptions essentially describe impeding P1 issues:
  • Our backlog is increasing. Current impact is minor, but if this issue is not fixed in 12 hrs, we're effectively down. 
  • A key monitoring component has failed. While there is no current impact, this means we have a blind spot that will cause the next failure to become a user visible outage. 
Response time expectations
If you have specific needs related to response time, indicate them clearly. For example, you might specify "I need a response by 5pm because that's when my shift ends." If you have internal outage communication SLOs, make sure you request a response time from your provider that is within that interval so you can meet those SLOs. Cloud providers likely have 24/7 support, but if this isn't the case, or if your relevant personnel are in a particular time zone, communicate your time zone-specific needs to your provider.

Including these details in your issue report will ideally save you time later and speed up your overall provider resolution process. Check out part two for tips on communicating with your cloud provider throughout the actual troubleshooting process.

Related content:
SRE vs. DevOps
Incident management at Google —adventures in SRE-land
Applying the Escalation Policy — CRE life lessons
Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin

Troubleshooting tips: Help your cloud provider help you



Editor’s note: We’re excited to bring you this blog post from the team of Google experts who wrote the book (really!) on Site Reliability Engineering (SRE) a few years back. The second edition of the book is underway, and this post delves into one area of SRE that’s relevant to many IT teams today: troubleshooting in the age of cloud computing. This is part two of two. Check out part one on writing better issue reports for cloud provider support.

Troubleshooting computer systems is an act as old as computers themselves. Some might even call it an art. The cloud computing paradigm entails a fundamental change to how IT teams conduct troubleshooting.

Successful IT troubleshooting doesn’t depend only on luck or experience, but is a deliberate process that can be taught. When you’re using cloud-based infrastructure, you’re often troubleshooting via a cloud provider’s help desk, adding another layer to helping users. Because of this shift away from the traditional IT team model, your communications with the provider are essential. (See part one for more on putting together an effective issue report to improve troubleshooting from the start.)

Once you’ve communicated the issue to your provider, you’ll be working with the provider’s support team to get the issue fixed.

The essentials of cloud troubleshooting

Those diagnosing a technical problem with cloud infrastructure are seeking possible explanations (hypotheses) and evidence that explains the problem. In the short term, they look for changes in the system that roughly correlate with the problem, and consider rolling back, as a first step to mitigate the problem and stop the bleeding. The longer-term goal is to identify and fix the root cause so the problem will not recur.

From the site reliability engineering (SRE) perspective,  the general approach for troubleshooting is as follows:

  • Triage: Mitigate the impact if possible
  • Examine: Gather observations and share them
  • Diagnose: Create a hypothesis that explains the observations
  • Test and treat:
    • Identify tests that may prove or disprove the hypothesis
    • Execute the tests and agree on the meaning of the result
    • Move on to the next hypothesis; repeat until solved


When you’re working with a cloud provider on troubleshooting an issue, there are parts of the process you’re unable to control. But you can follow the steps on your end. Here’s what you can do when submitting a report to your cloud provider support team.

1. Communicate any troubleshooting you've already done
By the time you open an issue report, you've probably done some troubleshooting already. You may have checked the provider’s status page, for example. Share the steps you've taken and any key findings. Keep a timeline and log book of what you have done and share it with the provider. This means that you should start keeping a log book as soon as possible, from the start of detection of your problem. Keep in mind that while cloud providers may have telemetry that provides real-time omniscient awareness of the state of their infrastructure, the dependencies that result from your particular implementation may be less obvious. By design, your particular use of cloud resources is proprietary and private, so your troubleshooting vantage point is vital.

If you think you have a diagnosis, explain how you came to that conclusion. If you think others can reproduce the issue, include the steps to do so. A reproducible test in an issue report usually leads to the fastest resolution.

You may have an idea or guess about what's causing the problem. Be careful to avoid confirmation bias—looking for evidence to support your guess without considering evidence to the contrary.

2. Be specific and explicit about the issue
If you've ever played the telephone game, in which players whisper a message from person to person, you've seen how human translation and interpretation can lead to communication gaps. Rather than describing information in your provider communications, try to share it. Doing so reduces the chance that your reader will misinterpret what you're saying and can help speed up troubleshooting. Don’t assume that your provider has access to all of this information; customer privacy means that they may not, by design.

For example:

  • Use a screenshot to show exactly what you see
  • For web-based interfaces, provide a .HAR (Http ARchive) file
  • Attach information like tcpdump output, logs snippets and example stack traces

3. Report production outages quickly
An issue is considered to be a production outage if your application has stopped serving traffic to users or is experiencing similar business-critical impact. Report production outages to your cloud provider support as soon as possible. Issues that block a small number of developers in a developer test environment are normally not considered production outages, so they should be reported at lower priorities.

Normally, when cloud provider support is alerted about a production outage, they quickly triage the situation with the following steps:

  1. Immediately check for known issues affecting the infrastructure.
  2. Confirm the nature of the issue.
  3. Establish communication channels.


Typically, you can expect a quick response with a brief message, which might contain:

  • Whether or not there is a known issue affecting multiple customers
  • An acknowledgement that they can observe the issue you've reported or a request for more details
  • How they intend to communicate (for example, phone, Skype, or issue report)


It’s important to quickly create an issue report including the four critical details (described in part one  and then begin deeper troubleshooting on your side of the equation. If your organization has a defined incident management process (see Managing Incidents), escalating to your cloud provider should be among your initial steps.

4. Report networking issues with specificity
Most cloud providers’ networks are huge and complex, composed of many technologies and teams. It's important to quickly identify a networking-specific problem as such and engage with the team that can repair it.

Many networking issues have similar symptoms, like "can't connect to server," at a high level. This level of detail is typically too generic to be useful in identifying the root cause, so you need to provide more diagnostic information. Network issues relate to connectivity, which always involves at least two specific points: source and destination. Always include information about these points when reporting network issues.

To structure your issue report,  use the conceptual tool of a packet flow diagram:

  • Describe the important hops that a packet takes along a path from source to destination, along with any significant transformations (e.g., NAT) along the way.
  • Start by identifying the affected network endpoints by Internet IP address or by RFC 1918 private address, plus an ASN for the network.
  • Note anything meaningful about the endpoints, such as who controls them and whether they are associated with a DNS hostname. 
  • Note any intermediate encapsulation and/or indirection. For example: VPN tunneling, proxies or NAT gateways.
  • Note any intermediate filtering, like firewalls, CDN or WAF.


Many problems that manifest as high latency or intermittent packet loss will require a path analysis and/or a packet capture for diagnosis. Path analysis is a list of all hops that packets traverse (for example, MTR or tcptraceroute). A packet capture (a.k.a. pcap, derived from the name of the library libpcap) is an observation of real network traffic. It's important to take a packet capture for both endpoints, at the same time, which can be tricky. Practice with the necessary tools (for example tcpdump or Wireshark) and make sure they are installed before you need them.

5. Escalate when appropriate
If circumstances change, you may need to escalate the urgency of an issue so it receives attention quickly. Take this step if business impact increases, if an issue is stuck without progress after a lot of back-and-forth with support, or if some other factor calls for quicker resolution.

The most explicit way to escalate an issue is to change the priority of the issue report (for example, from P3 to P2). Provide comments about why you need to escalate so support can respond appropriately.

6. Create a summary document for long-running or difficult issues
Issue state and relevant information change over time as new facts come to light and hypotheses are ruled out. In the meantime, new people join the investigation. Help communicate relevant, up-to-date information by collecting information in a summary document.

A good summary document has the following dimensions:

  • The latest state summarized at the top
  • Links to all relevant issue reports and internal tracking bugs
  • A list of hypotheses which are potentially true, and hypotheses that have been ruled out already. When you start investigating a particular hypothesis, note that you are doing so, and mention the tests or tools that you intend to use. Often, you can get good advice or prevent duplicate work.


SAMPLE summary document format:

$TIMESTAMP
<Current customer impact> <Working theory and actions being taken> <Next steps>

13:00:00
Customer impact has been mitigated and resolved. Our networking provider was throttling our traffic because we forgot to pay our bill last month. Next step is to be nicer to our finance team.

12:00:00
More than 100 customers are actively complaining about not being able to reach our service. Our networking provider is throttling customer traffic to one of our load balancers. The response team is actively working with our networking provider’s tier 1 support to understand why and how this happened.

11:00:00
We have now received 100 complaints from 50 customers from four different geos that they cannot consistently reach our API at api.acme.com. Our engineers currently believe that an upstream networking issue is causing this. Next steps are to reach out to our networking provider to see if there are any upstream issues.

10:00:00
We have received five complaints from five customers that they are unable to reach api.acme.com. Our engineers are looking into the issue.


Try to keep each issue report focused on a single issue. Don't reopen an issue report to bring up a new issue, even if it's related to the original issue. Do reference similar issues in your new report to help your provider recognize patterns from systemic root causes.

Keep your communication skills sharp

Communicating highly detailed technical information in a clear and actionable manner can be difficult. Doing so requires focus and specific skills. This task is particularly challenging in stressful situations, because our biological response to stress works against the need for clear cognitive reasoning. The following techniques help make communication easier for everyone.

Help reduce cognitive load by writing a detailed issue report
Many issue reports require the reader to make inferences or calculations. This introduces cognitive load, which decreases the mental energy available for solving the technical problem.

When writing an issue report, be as specific and detailed as possible. While this attention to detail requires more time on the part of the writer, consider that an issue report is written once but read many times by many people. People can solve the problem faster together when equipped with comprehensive information. Avoid acronyms and internal company code names. Also, be mindful of protecting customer privacy when disclosing any information to any third party.

Use narrative techniques
Once upon a time, in a land far, far away...

Humans are very good at absorbing information in the form of stories, so you can get your point across quite effectively this way. Start with the context: What was happening when you first observed the problem? What potential fixes did you try? Who are the characters involved, and why does the issue matter to them?
Include visuals Illustrate your issue report with any supporting images you have available, like text formatting or charts and screenshots.

Text formatting
Formatted text like log lines, code excerpts or MySQL results often become illegible when sent through plain-text emails. Add explicit markers (for example, <<<<<< at the end of the line) to help direct attention to important sections. You can use footnotes to point to long-form URLs, or use a URL shortener.

Use bullet points to format lists, and to call out important details like instance names. Use numbered lists to enumerate series of steps.

Charts
Charts are very useful for understanding time-series data. When you’re sending charts with an issue report, keep these best practices in mind:

  • Take a screenshot, including title and axis labels. For absolute values, specify units (requests per minute, errors per second, etc).
  • Annotate the screenshot with arrows or circles to call out important points.
  • Briefly describe what the chart is measuring.
  • Briefly describe how the chart normally looks.
  • In a few sentences, describe your interpretation of the chart and why it is relevant to the problem.


Avoid the following antipatterns:

  • The Y-axis represents a specific error (e.g., exceptions in my-handler) and has no clear relationship to the problem under investigation (e.g., high persistence-layer latency). To remedy this situation, explain why the graph is relevant to the problem.
  • The Y-axis is an absolute number (e.g., 1M per minute) that provides no context about the relative impact.
  • The X-axis doesn't have a time zone.
  • The Y-axis is not zero-based. This can make minor changes in the Y value seem very large.
  • Axis labels are missing or cut off.

Well-crafted issue reports, along with strong communication with your cloud provider, can speed the resolution process and time it takes. The cloud computing model has drastically changed the way that IT teams troubleshoot computer systems. Technical savvy is no longer the sole necessary skill set for effective troubleshooting--you must also be able to communicate clearly and efficiently with cloud providers. While the reality of your deployment may be unique, nuanced, and complex, these building blocks can help you navigate this territory.

Related content:
SLOs, SLIs, SLAs, oh my - CRE life lessons
SRE vs. DevOps: competing standards or close friends?
Introducing Google Customer Reliability Engineering


Special thanks to Ralph Pearson, J.C. van Winkel, John Lowry, Dermot Duffy and Dave Rensin

Cloud Source Repositories: more than just a private Git repository



If your goal is to release software continuously at high velocity, you need to be able to automatically build, test, deploy, and debug your code changes, all within minutes. But first you need to integrate your version control systems and your build, deploy, and debugging tools—a time-consuming and complicated process that requires numerous manual configuration steps like downloading plugins and setting up webhooks. And when you’re done, the workflow still isn’t very integrated, forcing developers to jump from one tool to another as they go from code to deployment. So much for high velocity.

Cloud Source Repositories, fully-managed private Git repositories hosted on Google Cloud Platform (GCP), is tightly integrated with other GCP tools, making it easy to automatically build, test, deploy, and debug code right out of the gate. With just a few clicks and without any additional setup or configuration, you can extend Cloud Source Repositories with other GCP tools to perform other tasks as a part of your development workflow. In this post, let’s take a closer look at some of the GCP tools that are integrated with Cloud Source Repositories, and how they simplify developer workflows:

Simplified continuous integration (CI) with Container Builder

Looking to implement continuous integration and validate each check-in to a shared repository with an automated build and test? The integration of Cloud Source Repositories with Container Builder comes in handy here, making it easy to set up a CI on a branch or tag. There are no CI servers to set up or repositories to configure. In fact, you can enable a CI process on any existing or new repo present in Cloud Source Repositories. Simply specify the trigger on which Container Builder should build the image. In the following example, for instance, the trigger specifies that a build will be triggered when changes are pushed to any branch of Cloud Source Repositories.


To demonstrate this trigger in action, the example below changes the background color of the “Hello World” website from yellow to blue.

The first step involves setting blue as the background color using background-color CSS property. Then, you need to add the changed file to the index using a git add command and record the changes to the repository with git commit. The commits are then pushed to the remote server using git push.

Because of the trigger defined above, an automated build is triggered as soon as changes are pushed to Cloud Source Repositories. Container Builder starts automatically building the image based on the changes. Once the image is created, the new version of the app is deployed using kubectl set image. The new changes are reflected and the “Hello World” website now shows a blue background color.

Follow this quickstart to begin continuous integration with Container Builder & Cloud Source Repositories.

Pre-Installed tools and programming languages in Cloud Shell and Cloud Shell Editor

Cloud Source Repositories is integrated out-of-the-box with Cloud Shell and the Cloud Shell Editor. Cloud Shell provides browser-based command-line access, giving you an easy way to build and deploy applications. It is already configured with common tools such as MySql client, kubernetes, and Docker, as well as Java, Go, Python, Node.js, PHP and Ruby, so you don't have to spend time looking for the latest dependencies or installing software. Cloud Editor, meanwhile, acts as a cross-platform IDE to edit code with no setup.

Quick deployment to App Engine

The integration of Cloud Source Repositories and App Engine makes publishing applications a breeze. It provides you a way to deploy apps quickly and lets developers focus just on writing code, without the worry of managing the underlying infrastructure or even scaling the app as its needs grow. You can deploy source code stored in Cloud Source Repositories to App Engine with the gcloud app deploy command, which automatically builds an image and deploys it to App Engine flexible environment. Let’s see this in action.

In the following example, we’ll change the text on the website from “Hello Universe” to “Hello World” before deploying it. Like with the previous example, git add and git commit help stage and commit the files staged to Cloud Source Repositories. Next, the git push command pushes the changes to the master branch.

Once the changes have been pushed to Cloud Source Repositories, you can deploy the new version of the application by running the gcloud app deploy command from the directory where the app.yaml file is located.

The text is now changed to “Hello, World!” from “Hello Universe”.

Try deploying code stored in Cloud Source Repositories to App Engine by following the quickstart here.

Debug in production with Stackdriver Debugger

If your app is running in production and has problems, you need to troubleshoot issues quickly to avoid bad customer experiences. For debugging apps in production, creating breakpoints isn't really an option as you can’t suspend the program. To help locate the root cause of production issues quickly, Cloud Source Repositories is integrated with Stackdriver Debugger, which lets you debug applications in production without stopping or slowing the application.

Stackdriver Debugger allows you to either use a debug snapshot or debug logpoint to debug production applications. Debug Snapshot captures the call stack and variables at a specific code location the first time any instance of that code is executed. Debug Logpoint, on the other hand, writes the log messages to the log stream. You can set a debug snapshot or a debug logpoint for code stored in Cloud Source Repositories with a single click.

Debug Snapshot for debugging

In the following example, a snapshot has been set up for the second line of code in the get function of the MainPage class.

The right-hand panel display details such as the call stack and the values of local variables in scope once the snapshot set above is reached.

Learn more about production debugging by following the quickstart here.

Debug Logpoint for Debugging

The integration of Stackdriver with Cloud Source Repositories also allows for injecting logging statements without restarting the app. It lets you store, search, analyze, monitor, and alert on log data and events. As an example, a logging statement introduced in the above code is highlighted below.

The logs panel highlights the logs printed by logpoint.

Version control with Cloud Functions

If you’re building a serverless app, you’ll be happy to know that Cloud Source Repositories is also integrated with Cloud Functions. You can store your function source code in Cloud Source Repositories and reference it from event-driven serverless apps. The code stored in Cloud Source Repositories can also be deployed in response to specific triggers, ranging from HTTP, Cloud Pub/Sub, and others. Changes made to function source code are automatically tracked over time, and you can roll back to the previous state of any repository.

In the following example, the “helloworld” function is deployed by an HTTP request. The location of the source code for function can be found in the root directory of the Cloud Source repository.

Learn more about deploying your function source code stored in Cloud Source Repositories using the quickstart here.

In short, the integration of Cloud Source Repositories with other Google Cloud tools lets your team to go from code to deployment in minutes, all while managing versioning and aliasing. You even get the ability to perform production debugging on the fly by using built-in monitoring logging tools. Try Cloud Source Repositories along with these integrations here.

Google is named a leader in the 2018 Gartner Infrastructure as a Service Magic Quadrant



We’re pleased to announce that Gartner recently named Google as a Leader in the 2018 Gartner Infrastructure as a Service Magic Quadrant (report available here).

With an increasing number of enterprises turning to the cloud to build and scale their businesses, research from organizations like Gartner can help you evaluate and compare cloud providers.

We believe being recognized by Gartner as one of the three leading cloud providers demonstrates our commitment to building innovative technology that helps customers run their businesses at scale. It also highlights our goal to help customers transform their businesses through open source and deep investments in analytics and machine learning.

Here are a few takeaways from the report:

A solid compute foundation
Gartner identifies our core IaaS and PaaS capabilities as a strength, and noted that we’re increasingly offering a number of innovative capabilities. From custom machine types and sustained use discounts, to the next generation of cloud-native containerized development and operations through tools like Kubernetes and Istio, we work hard to deliver a cloud that can run your most demanding applications.

Our investments in analytics and ML
The report recognizes the investments we’ve made in advanced analytics and machine learning. Our Google Cloud AI team has been making good progress towards this goal. In 2017, we introduced Cloud Machine Learning Engine, to help developers with machine learning expertise easily build ML models that work on any type of data, of any size. We showed how modern machine learning services, i.e., APIs—including Vision, Speech, NLP, Translation, and Dialogflow—could be built upon pre-trained models to bring scale and speed to business applications. Kaggle, our community of data scientists and ML researchers, has grown to more than one million members. And today, more than 10,000 businesses are using Google Cloud AI services, including companies like Kewpie, and Ocado. Recently introduced Cloud AutoML to help businesses with limited ML expertise start building their own high-quality custom models.

Our commitment to openness
Our strong grounding in the open source ecosystem, with an emphasis on portability, was highlighted in the report. Our goal is to help more organizations take advantage of cloud services, which means offering the tools to build, scale, and quickly move to the cloud. Our dedication to portability and open source gives you the flexibility to build on your own terms.

Sharing our best practices with customers
Our Customer Reliability Engineering (CRE) program is an approach that can help customers succeed while running their operations on Google Cloud Platform (GCP). We built CRE to provide a shared operational fate between you and Google, giving you more control over the critical applications you’ve entrusted with us.

Google Cloud continues to be adopted by enterprises who are looking to achieve greater availability, scalability, and security in the cloud. Gartner’s IaaS Magic Quadrant is now the sixth report from a leading analyst firm that has identified Google Cloud as a Leader. You can download a complimentary copy of the Gartner Cloud Infrastructure as a Service Magic Quadrant report on our website.

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Gain visibility and take control of Stackdriver costs with new metrics and tools



A few months back, we announced new simplified Stackdriver pricing that will go into effect on June 30. We’re excited to bring this change to our users. To streamline this change, you’ll receive advanced notifications and alerting on the performance and diagnostics data you track for cloud applications, plus flexibility in creating dashboards, without having to opt in to the premium pricing tier.

We’ve added new metrics and views to help you understand your Stackdriver usage now as you prepare for the new pricing to take effect. We’ve got some tips to help you maximize value while minimizing costs for your monitoring, logging and application performance management (APM) solutions.

Getting visibility into your monitoring and logging usage

In anticipation of the pricing changes, we’ve added new metrics to make it easier than ever to understand your logs and metrics volume. There are three different ways to view your usage, depending on which tool you prefer: the billing console; updated summary pages in the Stackdriver console; or metrics available via the API and Metrics Explorer.

1. Analyzing Stackdriver costs using the billing console
Stackdriver is now reporting logging and monitoring usage on the new SKUs (fancy name for something you can buy—in this case, volume of metrics or logs), which are visible in the billing console. Don’t worry—until June 30, the costs will still be $0, but you can view your existing volume across your billing account by going to the new reports page in the billing console. To view your current Stackdriver logging and monitoring usage volume, select group by SKU, filter for Log Volume, Metric Volume or Monitoring API Requests, and you’ll see your usage across your billing account. (See more in our documentation). You can also analyze your usage by exporting your billing data to BigQuery. Once you understand your usage, you can easily estimate what your cost will be after June 30 using the pricing calculator under the Upcoming Model tab.

2. Analyzing Stackdriver costs using the Stackdriver console
We’ve also updated the tools for viewing and managing volumes of logs and metrics within Stackdriver itself.


The Logs Ingestion page, above, now shows last month’s volume in addition to the current month’s volume for the project and by resource type. We’ve also added handy links to view detailed usage in Metrics Explorer right from this page as well.

The Monitoring Resource Usage page, above, now shows your metrics volume month-to-date vs. the last calendar month (note that these metrics are brand-new, so they will take some time to populate). All projects in your Stackdriver account are broken out individually. We’ve also added the capability to see your projected total for the month and added links to see the details in Metrics Explorer.

3. Analyzing Stackdriver costs using the API and Metrics Explorer
If you’d like to understand which logs or metrics are costing the most, you’re in luck—we now have even better tools for viewing, analyzing and alerting on metrics. For Stackdriver Logging, we’ve added two new metrics:
  • logging.googleapis.com/billing/bytes_ingested provides real-time incremental delta values that can be used to calculate your rates of log volume ingestion. It does not cover excluded logs volume. This metric provides a resource_type label to analyze log volume by various monitored resource types that are sending logs.
  • logging.googleapis.com/billing/monthly_bytes_ingested provides your usage as a month-to-date sum every 30 minutes and resets to zero every month. This can be useful for alerting on month-to-date log volume so that you can create or update exclusions as needed.
We’ve also added a new metric for Stackdriver Monitoring to make it easier to understand your costs:
  • monitoring.googleapis.com/billing/bytes_ingested provides real-time incremental deltas that can be used to calculate your rate of metrics volume ingestion. You can drill down and group or filter by metric_domain to separate out usage for your agent, AWS, custom or logs-based metrics. You can also drill down by individual metric_type or resource_type.
You can access these metrics via the monitoring API, create charts for them in Stackdriver or explore them in real time in Metrics Explorer (shown below), where you can easily group by the provided labels in each metric, or use Outlier mode to detect top metric or resource type with the highest usage. You can read more about aggregations in our documentation.

If you’re interested in an even deeper analysis of your logs usage, check out this post by one of Google’s Technical Solutions Consultants that will show you how to analyze your log volume using logs-based metrics in Datalab.


Controlling your monitoring and logging costs
Our new pricing model is designed to make the same powerful log and metric analysis we use within Google accessible to everyone who wants to run reliable systems. That means you can focus on building great software, not on building logging and monitoring systems. This new model brings you a few notable benefits:
  • Generous allocations for monitoring, logging and trace, so many small or medium customers can use Stackdriver on their services at no cost.
    • Monitoring: All Google Cloud Platform (GCP) metrics and the first 150 MB of non-GCP metrics per month are available at no cost.
    • Logging: 50 GB free per month, plus all admin activity audit logs, are available at no cost.
  • Pay only for the data you want. Our pricing model is designed to put you in control.
    • Monitoring: When using Stackdriver, you pay for the volume of data you send, so a metric sent once an hour costs 1/60th as much as a metric sent once a minute. You’ll want to keep that in mind when setting up your monitoring schedules. We recommend collecting key logs and metrics via agents or custom metrics for everything in production; development environments may not need the same level of visibility. For custom metrics, you can write points at a smaller time granularity. Another way is to reduce the number of time series sent by avoiding unnecessary labels for custom and logs-based metrics that may have high cardinality.
    • Logging: The exclusion filter in Logging is an incredible tool for managing your costs. The way we’ve designed our system to manage logs is truly unique. As the image below shows, you can choose to export your logs to BigQuery, Cloud Storage or Cloud Pub/Sub without needing to pay to ingest them into Stackdriver.
      You can even use exclusion filters to collect a percentage of logs, such as 1% of successful HTTP responses. Plus, exclusion filters are easy to update, so if you’re troubleshooting your system, you can always temporarily increase the logs you’re ingesting.

Putting it all together: managing to your budget
Let’s look at how to combine the visibility from the new metrics with the other tools in Stackdriver to follow a specific monthly budget. Suppose we have $50 per month to spend on logs, and we’d like to make that go as far as possible. We can afford to ingest 150 GB of logs for the month. Looking at the Log Ingestion page, shown below, we can easily get an idea of our volume from last month—200 GB. We can also see that 75 GB came from our Cloud Load Balancer, so we’ll add an exclusion filter for 99% of 200 responses.

To make sure we don’t go over our budget, we’ll also set a Stackdriver alert, shown below, for when we reach 145 GB on the monthly log bytes ingested. Based on the cost of ingesting log bytes, that’s just before we’ll reach the $50 monthly budget threshold.

Based on this alerting policy, suppose we get an email near the end of the month that our volume is at 145 GB for the month to date. We can turn off ingestion of all logs in the project with an exclusion filter like this:
logName:*

Now only admin activity audit logs will come through, since they don’t count toward any quota and can’t be excluded. Let’s suppose we also have a requirement to save all data access logs on our project. Our sinks to BigQuery for these logs will continue to work, even though we won’t see those logs in Stackdriver Logging until we disable the exclusion filter. So we won’t lose that data during that period of time.


Like managing your household budget, running out of funds at the end of the month isn’t a best practice. Turning off your logs should be considered a last option, similar to turning off your water in your house toward the end of the month. Both these scenarios run the risk of making it harder to put out fires or incidents that may come up. One such risk is that if you have an issue and need to contact GCP support, they won’t be able to see your logs and may not be able to help you.


With these tools, you’ll be able to plan ahead to help ensure you’re avoiding ingesting less useful logs throughout the month. You might turn off unnecessary logs based on use, rejigger production and development environment monitoring or logging, or decide to offload data to another service or database. Our new metrics, views and dashboards give you a lot more tools to see how much you’re spending in both resources and IT budget in Stackdriver. You’ll be able to bring flexibility and efficiency to logging and monitoring, and avoid unpleasant surprises. 


To learn more about Stackdriver, check out our documentation or join in the conversation in our discussion group.


Related content