Tag Archives: Google Cloud Platform

How we brought the latest version of Python to App Engine and Cloud Functions

At Cloud Next 2018, we added Python 3.7 support to Cloud Functions and now we’ve announced Python 3.7 support for the App Engine standard environment. These new runtimes allow you to write Python functions and apps using the latest version of Python and the rich ecosystem of packages available on Python Packaging Index (PyPI).

This new runtime marks a significant update to App Engine and was enabled by new open source software that we recently released: gVisor and FTL.

Python, straight from the source

Running Python 3.7 on App Engine and Cloud Functions required us to fundamentally rethink our infrastructure. Traditionally, meeting Google Cloud’s security requirements meant that we had to run a modified version of the Python interpreter. However, using a modified interpreter constrained some language features and only allowed us to support a limited set of whitelisted Python libraries.

Thanks to gVisor, a container sandbox that provides improved security and process isolation, we can now run the unmodified Python 3.7.0 interpreter. We’ve done extensive testing to make sure Python 3.7 is compatible with gVisor. As part of our compatibility testing, we run Python’s full suite of language tests, and tests for Python packages that are popular on PyPI. We’re committed to ensuring that everything you’ve come to know and love about Python is supported on our platform.

Seamless deployments

Most importantly, this change in our infrastructure makes it easier to take advantage of Python’s vast ecosystem. As a developer, you just add project dependencies to a requirements.txt file and deploy.

During deployment, FTL, a tool for building containers, fetches dependencies listed in your requirements.txt file and installs them alongside your app or function. FTL also includes a short-lived dependency cache, which speeds up repeated deployments if no changes are detected in your requirements.txt file. This is particularly useful if you find just need to re-deploy because you found a typo.

Keeping up with the Pythonistas

In making these changes, we also decided to expand the list of system packages that are included with each runtime’s Ubuntu 18.04 distribution. We think that will make life just a little bit easier for developers working with the latest release of Python.

Looking forward, we’re excited about how these changes will allow us to keep up with the Python community’s progress as they release new versions and libraries. Please let us know what you think and if you run into any challenges.

You can learn more about how to get started with it on App Engine and Cloud Functions in our documentation. We can’t wait to see what you build with Python 3.7.

By Stewart Reichling, Product Manager

We’ve moved! Come see our new home!


Ten years, three months and 30 days ago, we wrote our first post on this blog, and now, we’re writing our last at this particular web address. Today, it’s with great excitement that we present to you the Google Cloud blog, your home for all the latest GCP product news, how-to’s, perspectives and customer stories that you’re used to, all living happily on a shiny, new mobile-friendly platform.

We’re really excited about this change. Not only does the new blog look really nice, but it includes all the content from across the entire Google Cloud family—GCP, G Suite, Google Maps Platform and Chrome Enterprise—so you can see how they all fit together. And because data analysis and artificial intelligence are so central to everything people are building today, we’ve also folded our Big Data and Machine Learning blog into this new platform.

Besides collecting all Google Cloud blog content in one place, we think you’ll really benefit from the blog’s rich tagging capabilities. Now, you can view blog posts by platform, and also drill down to specific technology areas like Application Development, Networking or Open Source, so you can quickly find related content. There are also dedicated pages for partners, customers, trainings and certifications, and solutions and how-to’s, to name a few. And because we can also tag posts to multiple products and topics, you’ll be sure to find what you’re looking for.

Those are just the high-level changes. There are a whole lot of new features to use and explore, and we encourage you to browse the site and get familiar with it. What’s not new is our mission: to provide you with honest, technical content to show you how to build your business on GCP.

To date, we’ve migrated over two year’s worth of GCP blog posts to this new home, with more to come. Let us know if you find any broken links, typos, or just flat-out missing content. And of course, we’d love your feedback on our content, the design, or any features you’d like to see. Thanks for reading!

Last month today: July on GCP

The month of July saw our Google Cloud Next ‘18 conference come and go, and there was plenty of exciting news, updates and demos to share from the show. Here’s a look at some of the most-read blog posts from July.

What caught your attention this month: Creating the open cloud
  • One of the most-read posts this month covered the launch of our Cloud Services Platform, which allows you to build a true hybrid cloud infrastructure. Some of the key components of Cloud Services Platform include the managed Istio service mesh, Google Kubernetes Engine (GKE) On-Prem and GKE Policy Management, Cloud Build for fully managed CI/CD, and several serverless offerings (more on that below). Combined, these technologies can help you gain consistency, security, speed and flexibility of the cloud in your local data center, along with the freedom of workload portability to the environment of your choice.
  • Another popular read was a rundown of Google Cloud’s new serverless offerings. These include core serverless compute announcements such as new App Engine runtimes, Cloud Functions general availability and more. It also included serverless containers, so you can run serverless workloads in a fully managed container environment; GKE Serverless add-on to easily run serverless workloads on Kubernetes Engine; and Knative, the open-source project on which that add-on is built. There are even more features included in this post, too, like Cloud Build, Stackdriver monitoring and Cloud Firestore integration with GCP. 
Bringing detailed metrics and Kubernetes apps to the forefront
  • Another must-read post this month for many of you was Transparent SLIs: See Google Cloud the way your application experiences it, announcing the availability of detailed data insights on GCP services that your workloads use—helping you see like a Google site reliability engineer (SRE). These new service-level indicators (SLIs) go way beyond basic uptime and downtime to delve into response codes, latency and more. You can then separate out metrics by GCP service to see things like API version, location and protocol. The result is that you can filter and sort to get extremely fine-grained information on your software and the GCP services you use, which helps cut resolution times and improve the support experience. Transparent SLIs are available now through the Stackdriver monitoring console. Learn more here about the basics of using SLIs and other SRE tools to measure and manage availability.
  • It’s also now faster and easier to find production-ready commercial Kubernetes apps in the GCP Marketplace. These apps are prepackaged and configured to get up and running easily, whether on Kubernetes Engine or other Kubernetes clusters, and run the gamut from security, data analytics and developer tools to storage, machine learning and monitoring.
There was obviously a lot to talk about at the show, and you can get even more detail on what happened at Next ‘18 here.

Building the cloud back-end
  • For all of you developing cloud apps with Java, the availability of Jib was an exciting announcement last month. This open-source container image builder, available as Gradle and Maven plugins, cuts out several steps from the Docker build flow. Jib does all the work required to package your app into a container image—you don’t need to write a Dockerfile or even have Docker installed. You end up with faster builds and reproducible container images.
  • And on that topic, this best practices for building containers post was a hit, too, giving you tips that will set you up to run your environment more smoothly. The tips in this blog post cover graceful application shutdowns, how to simplify containers and how to choose and tag the container images you’ll use. 
It’s been a busy month at GCP, and we’re glad to share lots of new tools with you. Till next time, build away!

Repairing network hardware at scale with SRE principles



To support our Google Cloud Platform (GCP) customers, we run a complex global network that depends on multiple providers and a lot of hardware. Google network engineering uses a diverse set of vendor equipment to route user traffic from an internet service provider to one of our serving front ends inside a GCP data center. This equipment is proprietary and made by external networking vendors such as Arista, Cisco and Juniper. Each vendor has distinct operational methods, configurations and operational consoles.

With hundreds of distinct components utilized across our global network, we routinely deal with hardware failures—for example, a failed power supply, line card or control plane card. The complexity of today’s cloud networks means that there are a huge number of places where failure can occur. When we first began building and operating our own data centers, Google had a team of engineers, network engineers and site reliability engineers (SREs) who performed fault detection, mitigation and repair work on these devices, using manual processes guided by a ticket system. Google’s SRE principles are prescriptive, and aim to guide developers and operations teams toward better systems reliability. As with DevOps, avoiding toil—the manual tasks that can eat up too much time—is an essential goal.

We realized after becoming familiar with common hardware problems that any ticket type that we encountered repeatedly and that follows a predetermined sequence of steps can easily be automated. Our team created a list of playbooks over time that detailed steps of how to deal with each hardware failure scenario, taking into account relevant software and hardware bugs and typical steps to resolution. Each playbook is used when an alert is received. Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.

Building the automation interface

“In the old way of doing things, we treat our servers like pets, for example, Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.”
- Randy Bias

The above quote describes a classic engineering scenario often applied within SRE: "Pets vs. cattle," which describes a way of looking at data center hardware as either individual components or a herd of them. The two categories of equipment can be described as follows:

Pet:
  • An individual device you work on. You're familiar with all of its particular failure modes. 
  • When it gets sick, you come to the rescue.

Cattle:
  • A fleet of devices with a common interface.
  • You manage the "herd" of devices as a group.
  • The common interface lets you perform the same basic operations on any device, regardless of its manufacturer.
Before we moved to automating network hardware failure resolution, we were stuck handling our networking equipment like pets, with an eye toward what made it unique, rather than as cattle, with an eye toward what made it a commodity. We needed to make it easier not to custom-manage all these networking devices. Our initial automation design aimed to turn our fleet into cattle by providing a common interface for interacting with networking equipment. Specifically, we used the underlying primitives to implement a higher-level interface for performing common operations—in this case, the basic operations of a line card in a network device, regardless of vendor: "Bring it online," "Take it offline" and "Check the status." We defined the following interface for a line card, using the Go programming language.


type Linecard interface {
  Online() error
  Offline() error 
  Status() error
}
The error qualifier in Go simply means that the function returns an error object if it fails. The underlying code implementing this interface for a Juniper line card varies significantly from implementation on the Cisco line card, but the caller of the function is insulated from the implementation. The upper level code imports the library, and when it operates on a line card, it can only perform one of those three actions we specified above.

We then realized that we could apply the same interface to many hardware components—for example, a fan. For certain vendors, the Online() and Offline() functions did nothing, because those vendors didn't support turning a fan off, so we just used the interface to check the status.
type Fan interface {
  Online() error
  Offline() error 
  Status() error
}
Building upon this line of thought, we realized that we could generalize this interface to define a common interface for all hardware components within a device.
type Component interface {
  Online() error
  Offline() error 
  Status() error
}
By structuring the code this way, anyone can add a device from a new vendor. Moreover, anyone can add any type of new component as a library. Once the library implements this common interface, it can be registered as a handler for that specific vendor and component.

Deciding what to automate

The system needed to interact with humans at various stages of the automation. To decide what to automate, we drew a flow chart of the normal human-based repair sequence and drew boxes around stages we believed we could replace with automation. We used the task of replacing a vendor control plane board as an example. Many of the steps have self-explanatory names, but these are definitions of some of the more complex ones:
  • Determine control plane: Find faulty control plane unit.
  • Determine state: Is it the master or the backup? 
  • Copy image to control plane: Copy the appropriate software image to the master control plane.
  • Offline control plane: Send the backup control plane offline.
  • Toggle mastership: Make the replaced control plane the new master.
Figure 1: Manual workflow for replacing a vendor control plane board
When we needed to carry out this workflow, a Google network engineer performed each step in Figure 1, with the exception of pulling out and replacing the failed control plane, which was performed by someone on-site at a data center location.

Once we had defined this task, we created an automated workflow. The goal of the new system was to provide a UI for our hardware engineers in a data center that allowed them to perform one of those operations at a specific time under specific conditions and with various automated safety checks, followed by an entire device audit at the end of the operation. Previously, a human had performed all of these steps, but now a human only needed to perform the step “hardware gets replaced” in Figure 2—the hardware replacement.
Figure 2: Automated workflow for replacing a vendor control plane board
Automation, before and after
Figure 3: High-level system view.
You can see in Figure 3 what the system looked like after automation. Before automating this workflow, there would have been a lot of manual work. When an alert initially came in, an engineer would have stopped traffic to the device, and offlined by hand the bad component. Our network operations center (NOC) team would then work with the vendor—for example, Juniper or Cisco— to get a replacement part on-site. Next, we would file a change request in our change management system, noting the date of the operation.

On the day of the operation:
  • The data center technician would click “start” on the change management system to begin the repair.
  • Our system picks up this change and is ready to begin the repair.
  • The technician clicks “start” on our UI.
  • An “offline” state machine starts proceeding through the various steps to take the component offline safely.
  • The UI notifies the user each step of the way.
  • Once the state machine has completed, it notifies the technician, who can safely replace the component.
  • Once the component is replaced and re-cabled, the technician returns to the UI and begins the “online” state machine, which safely returns the component into production.
When we reviewed our original automation design, we noticed there would be a lot of work involved in building the various systems needed to implement the automated workflow. To facilitate collaboration, we created ticket items for each component of the system, so multiple engineers could work on the project in parallel.

Automation lessons learned

We used an iterative approach in our planning and execution. We first focused on replacing the line card for one vendor, then moved on to multiple vendors and multiple components. Due to the modular design of the code base and the interacting systems, adding more modules and scaling the code horizontally was easy. 

For example, adding a new library that handled fan replacements meant simply creating the code to handle this and ensuring it implemented the above interface. Then it registered itself in the main function.

We had the option to extend or repurpose existing automation systems owned by our software management teams to meet our needs. We had to carefully consider whether to use those systems or build our own, potentially duplicating work if we chose the latter. Ultimately, we built our own automation because the other systems were understaffed. Trying to extend their tools would have disrupted other teams' project work and delayed our own project.

What worked well
Leveraging multiple engineers to automate our internal part of the workflow allowed us to take the project from design to implementation within a short period—about one year.

What didn’t
We haven't yet fully automated our hardware replacement workflow. Doing so involves troubleshooting hardware issues with vendors and persuading them that each individual failure merits a device or component replacement. We work around this gap in our automation by keeping spares on site for use with our repair automation, and handling the vendor workflow portion of the process separately and mostly manually through our NOC. We are currently working toward a fully automated vendor interaction with our vendor partners.

Measuring automation success
We can measure the hours our automation saves engineers using Google's production change logging service, which all internal tools use to record changes made to the production environment. The service logs changes made by tools manually invoked by engineers as well as tools that provide end-to-end automation without manual input. Thus we can compare how long each network repair action used to take when performed manually vs. the number of repair actions that are undertaken by today's fully automated system. These two data sets allow us to calculate the total time savings from automation. As shown in Figure 4, network hardware repair automation saves us hundreds of hours every month.

Tips for reducing toil through automation

While strategies for eliminating toil must be tailored to your individual environment and use cases, some approaches are universal. Based upon our own experience eliminating toil by automating network repair tasks, we recommend the following: 
  • Measure your toil.  
  • Tackle the biggest sources of toil first, and don't try to solve all problems at once.  
  • Carefully consider whether to enhance existing tools or build new ones. Even if you can partially repurpose another team's work, would creating a tool from scratch actually make more sense cost- or resource-wise? 
  • Take a design-driven approach. Iterate on the design, starting small and iterating quickly. Don't try to design the perfect approach from the start.  
  • Measure your time savings to determine your return on investment.
Automation has proved useful for our team of network site reliability engineers at GCP. Learn more about the practice of SRE and how you might apply its principles to your own network projects.

Istio reaches 1.0: ready for prod



Today, Google Cloud is proud to announce, together with our collaborators, that the Istio open-source project has reached the 1.0 milestone. This is a key step toward delivering the Cloud Services Platform that we discussed last week, helping you manage your services in a hybrid world where some of your infrastructure runs on VMs and some in Kubernetes, some services run in the cloud and some on-premises.

Istio: a service mesh

Istio is at its heart a service mesh—software that layers transparently onto an existing distributed application. It collects logs, traces and telemetry, and adds security and policy without embedding client libraries. Moreover, Istio is also a platform, complete with APIs that let you integrate with systems for logging, telemetry and policy.

Istio delivers a service-based view of the service interactions across the mesh. Whereas traditional monitoring gives you low-level metrics such as nodes’ CPU consumption, Istio measures the actual traffic between services: requests per second, error rates and latency. It also generates a dependency graph so you can see how services affect one another.

With Istio, your DevOps team gets the tools it needs to run distributed apps smoothly. Istio does canary rollouts, letting you smoke-test a new build to make sure it’s performing well before ramping up. It also offers fault-injection, retry logic and circuit breaking so DevOps teams can do more testing and change network behavior at runtime to keep applications up and running.

And finally, Istio adds security. It can be used to layer mTLS on every call, adding encryption-in-flight and giving you the ability to authorize every single call on your cluster and in your mesh.

Istio in action

Istio provides foundational capabilities for your infrastructure, freeing developers to work on code that is critical to your business. But there’s only one way to prove that Istio is ready for the enterprise: by running real workloads on it in production. Already, there are at least a dozen companies running Istio in production, including several on GCP. We worked with them through early hurdles, incorporated their feedback, and they’re reaping the benefits of Istio already. A great example is Auto Trader UK, which used Istio to help accelerate their move to containers and the public cloud.

Auto Trader UK is not only migrating from private cloud to public cloud, but also moving from virtual machines to Kubernetes. The level of control and visibility that Istio provides has enabled us to significantly de-risk this ambitious work, and in several cases has actually helped surface issues we were previously unaware of. We've been able to accelerate the delivery of capabilities such as mutual TLS, that previously would have taken significant engineering effort, allowing us to focus on our market differentiators.
- Karl Stoney, Delivery Infrastructure Lead, Auto Trader UK

A true joint effort

We first released Istio as open source last year, and what a year it’s been. Since that first 0.1 release, Istio has improved and matured significantly, with eight versions, 200+ contributors, and 4,000+ check-ins adding an ever growing set of functionality.

Getting to version 1.0 was truly a community-driven effort. IBM was a key collaborator and co-founder, and Lyft’s Envoy proxy is a key component of the project. Since then, the number of companies involved in Istio has skyrocketed, including Cisco, Red Hat, and VMware consolidating industry support with the goal of accelerating adoption and meeting the service mesh needs of their customers.

“The growth of Istio since its launch last year has been tremendous, and it’s quickly taking its place as the standard way to manage microservices in the cloud,” said Jason McGee, IBM Fellow and VP, IBM Cloud. “Our mission since Istio’s launch has been to enable everyone to succeed with microservices, especially in the enterprise. This is why we’ve focused the community around improving security and scale, and heavily leaned our contributions on what we’ve learned from building agile cloud architectures for companies of all sizes.”
- Jason McGee, IBM Fellow and VP, IBM Cloud 
"We see Istio's potential to be able to solve some of the most complex aspects of application development and deployment. It brings a control plane for service mesh, cluster orchestration, and network control that will support and enable developers to focus on the more important aspects of their application development. We are looking forward to leveraging Istio in Red Hat OpenShift to enable developers to deploy their applications in a more secure and efficient manner." 
- Brian 'Redbeard' Harrington, product manager, Istio, Red Hat
“VMware has been an integral part of the community developing Istio service mesh. We see great potential in Istio’s service-based approach to connectivity, security, and observability. We believe it will become an infrastructure cornerstone, spanning across vSphere and Kubernetes platforms and multiple private and public clouds, and helping our enterprise customers improve development efficiencies and deliver on their SLAs / SLOs in a secure manner. Istio’s application layer complements the network virtualization layer, and together allow enterprises to achieve defense in depth, improve performance and scalability, and speed time to application value.” 
- Pere Monclus, CTO Network and Security, VMware

We’re also thrilled with the number of companies writing adapters for Istio—from observability software from SolarWinds and Datadog, to deployment tools from Weaveworks and CodeFresh, to policy and security offerings from Aspenmesh and Octarine. While Istio is transparent to application developers, it provides a standard integration interface for anyone writing observability tools or policy engines.

Working and integrating with other open source projects in the community drives our success, as well. Integrations with SPIFFE, the Open Policy Agent and OpenTracing all improve the state of open source and the lives of developers.

Istio on GCP

While the open-source Istio project is a major undertaking, we’re also intent on making it especially easy to use on Google Cloud Platform. Last week at Google Cloud Next we announced the alpha release of Managed Istio: open-source Istio that’s automatically installed and upgraded on your Kubernetes Engine clusters as a part of the Cloud Services Platform. Managed Istio will help provide the visibility, security and control you need over services running in hybrid environments, and it integrates with other Google products like Stackdriver and Apigee.

Achieving 1.0 is just a first step, both for the project and for us at Google Cloud. We have ambitious plans for adding features and improving Istio’s usability with  the ultimate goal of delivering a complete set of tools to manage all of your services, so that you can focus on writing software and running a business.

To find out more about Istio and how to get started using it on GCP, please visit cloud.google.com/istio.

Access Google Cloud services, right from IntelliJ IDEA



Great news for IntelliJ users: You can now use Google Cloud services and APIs right from JetBrains’ integrated development environment (IDE). With the Cloud Tools for IntelliJ plugin, you can now discover APIs, consume them, and test against them locally, all without leaving your IDE.

The Cloud Tools plugin for IntelliJ streamlines the development process by integrating tasks into the IDE, such as enabling Google Cloud APIs, creating service accounts for local development, and adding the corresponding Java client libraries to your build.
Example: Using the Cloud Translation API with the Cloud Tools for IntelliJ plugin

Say you are interested in using the Cloud Translation API in our Java Maven-based project. If the Cloud Tools for IntelliJ plugin isn’t already configured, then first install it as described in this quickstart.

Clone the example Cloud Translation project, which allows you to translate some input text from English to French.

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
Open the project, located under “java-docs-samples/translate”:

At this point, you might simply try to run the application by navigating to the main method and clicking the play button:

… and configuring the input arguments to translate some text from English to French by editing the newly created run configuration:

Run the program again, and this time you get the following error:

As you may have already guessed, you’re missing authentication rights to access the Cloud Translation API from your local machine. To overcome this, you’d normally have to go through the following steps:
  1. Enable the service on your Google Cloud Platform (GCP) project
  2. Create a new service account with the appropriate roles for accessing the service
  3. Update your local run configuration with the necessary environment variables to access the service
Thankfully, the Cloud Tools for IntelliJ plugin can help. In IntelliJ, navigate to the Cloud Tools menu item under “Tools > Google Cloud Tools > Add Cloud libraries …”:

Select the Cloud Translation API and your GCP project, and click “Add Cloud Libraries”:

In the confirmation window that appears, you can see that Cloud Tools for IntelliJ takes care of enabling the API and creating the service account for you:

Lastly, select the run configuration that you created earlier so that the plugin can inject the necessary environment variables for accessing the Cloud Translation service from your local machine:

Run the program again and your input text is successfully translated from English to French using the Cloud Translation service:

The Cloud Tools for IntelliJ plugin also assists with the following:
  • Adding Java client libraries to your Maven pom.xml if they are not already present
  • Writing a Bill of Materials (BOM) to your pom.xml to help avoid dependency version conflicts
  • Detecting and acting on potential misconfigurations, including a missing BOM, through pom.xml file inspections with quick-fixes
The Cloud Tools for IntelliJ plugin provides many more features to help optimize your development workflow including support for Google App Engine, Stackdriver Debugger, Cloud Repositories, and Cloud Storage. For more information and to leave feedback please visit the official documentation and GitHub pages:

Cloud Tools for IntelliJ:

Drilling down into Stackdriver Service Monitoring



If you’re responsible for application performance and availability, you know how hard it can be to see it through the eyes of your customers and end users. We think that’s really going to change with last week’s introduction of Stackdriver Service Monitoring, a new tool for monitoring how your customers perceive your applications, and that then lets you drill down to the underlying infrastructure when there’s a problem.

Most IT operations tools take a bottoms-up understanding of IT systems: they look at compute, storage, and networking metrics to infer the customer experience. Application performance management (APM) tools like tracing systems, debuggers, and profilers consider the application from the code level—but lose sight of the underlying infrastructure. Sometimes, a logs analytics solution can provide the glue between those two layers, but often with great effort and expense.

IT operators have been missing a cost-effective, easy-to-use, general-purpose tool to monitor the customer-facing behavior of their applications. It’s hard to know how end users experience your software and it’s difficult to measure services and applications in a standardized way. Ops staff risk burning out from all the spurious alerts. The result of all this is that mean-time-to-resolution (MTTR) is longer than necessary, and customer satisfaction is lower than desired. The situation is exacerbated with microservice architectures where the app itself is broken into many small pieces, which makes it hard to understand how all the pieces fit together and where to start investigating when there is a problem.

That all changes with the release of Stackdriver Service Monitoring. Service Monitoring takes advantage of service-aware, “opinionated” infrastructure so you can monitor how end users perceive your systems, letting you drill down to the infrastructure level when necessary. Initially, we are supporting this functionality for Google App Engine and for Istio service meshes running on Google Kubernetes Engine. We will expand to more platforms over time.

With Stackdriver Service Monitoring, you get the answers to the following questions:
  • What are your services? What functionality do those services expose to internal and external customers?
  • What are your promises and commitments regarding the availability and performance of those services, and are your services meeting them?
  • For microservices-based apps, what are the inter-service dependencies? How can you use that knowledge to double check new code rollouts and triage problems in the event of service degradation?
  • Can you look at all the monitoring signals for a service holistically to reduce MTTR?

Anatomy of Stackdriver Service Monitoring

Service Monitoring has three pieces: the service graph, Service Level Objectives (SLOs), and multi-signal service dashboards. Together, these give you an inventory of your services, visually display the dependencies between them, let you set and measure availability and performance promises, help you triage application problems to quickly find the root cause, and finally, help you debug broken services more quickly than ever before. Let’s look at each piece in turn.

The service graph: This is a service-specific view of your infrastructure. It starts out with a real-time top level display of all services in the Istio service mesh and the communication links between them. Selecting one service displays charts with error rates and latency metrics. Double-clicking on a service allows you to drill down into its underlying Kubernetes infrastructure, providing the long elusive connection between app behavior and infrastructure. There is also a time slider which allows you to see the graph at previous points in time. Using the service graph you can see your application architecture for reference purposes or to triage problems. You can explore metrics about service behavior, and determine whether an upstream service is causing problems to a downstream service. Finally, you can compare the service graph at different points in time to determine whether there was a significant architectural change right before a problem was reported. There is no quicker way to get started exploring and understanding complex multi-service applications.

SLOs: Internally at Google, our Site Reliability Engineering team (SRE) only alert themselves on customer-facing symptoms of problems, and not all potential causes. This better aligns them to customer interests, lowers their toil, frees them to do value-added reliability engineering, and increases job satisfaction. Stackdriver Service Monitoring lets you to set, monitor, and alert on SLOs. Because Istio and App Engine are instrumented in an opinionated way, we know exactly what the transaction counts, error counts, and latency distributions are between services. All you need to do is set your targets for availability and performance and we automatically generate the graphs for service level indicators (SLIs), compliance to your targets over time, and your remaining error budget. You can configure the maximum allowed drop rate for your error budget; if that rate is exceeded, we notify you and create an incident so that you can take action. To learn more about SLO concepts including error budget, we encourage you to read the SLO chapter of the SRE book.

Service Dashboard: At some point, you will need to dig deeper into a service’s signals. Maybe you received an SLO alert and there’s no obvious upstream cause. Maybe the service is implicated by the service graph as a possible cause for another service’s SLO alert. Maybe you have a customer complaint outside of an SLO alert that you need to investigate. Or, maybe you want to see how the rollout of a new version of code is going.

The service dashboard provides a single coherent display of all signals for a specific service, all of them scoped to the same timeframe with a single control, providing you the fastest possible way to get to the bottom of a problem with your service. Service monitoring lets you dig deep into the service’s behavior across all signals without having to bounce between different products, tools, or web pages for metrics, logs, and traces. The dashboard gives you a view of the SLOs in one tab, the service metrics (transaction rates, error rates, and latencies) in a second tab, and diagnostics (traces, error reports, and logs) in the third tab.

Once you’ve validated an error budget drop in the first tab and isolated anomalous traffic in the second tab, you can drill down further in the diagnostics tab. For performance issues, you can drill down into long tail traces, and from there easily get into Stackdriver Profiler if your app is instrumented for it. For availability issues you can drill down into logs and error reports, examine stack traces, and open the Stackdriver Debugger, if the app is instrumented for it.

Stackdriver Service Monitoring gives you a whole new way to view your application architecture, reason about its customer-facing behaviors, and get to the root of any problems that arise. It takes advantage of infrastructure software enhancements that Google has championed in the open source-world, and leverages the hard-won knowledge of our SRE teams. We think this will fundamentally transform the ops experience of cloud native and microservice development and operations teams. To learn more see the presentation and demo with Descartes Labs at GCP Next last week. We hope you will sign up to try it out and share your feedback.

Transparent SLIs: See Google Cloud the way your application experiences it



Like all good IT organizations, you religiously measure the performance and availability of your services and applications. But if those apps run in the cloud, critical components are often delivered by a third party or the cloud provider. In the case of a service disruption or degraded performance, how do you know what the problem is—your code, the network, or the provider? And, if the problem is with the service provider, how do you convince them to take action as quickly as possible?

Here at Google Cloud, we are the first cloud provider to report detailed standardized metrics on the behavior of our more than 130 Google Cloud service APIs, and how they are experienced by your applications. Today, we are happy to announce Transparent SLIs (service level indicators) - fine-grained detail about the behavior of Google Cloud Platform (GCP) services as related to your workloads. We display this data in Stackdriver Monitoring dashboards, and it's the same kind of data that Google SREs use to keep our services up and running. (Visit this post to learn more about SLIs.)

Transparent SLI metrics go far beyond simple up/down monitoring of our services. Now, you can debug subtle interactions between your application and our service from Stackdriver metrics such as how many transactions you sent, the rates of their various response codes, and their latency distribution. Then, for each service, you can slice and dice the metrics according to:
  • Service name
  • Method
  • API version
  • Credential ID
  • Location
  • Protocol (HTTP / gRPC)
  • HTTP Response Code (e.g. 402)
  • HTTP Response Code class (e.g. 4xx)
  • gRPC Status Code
Using Stackdriver’s Metrics Explorer, you can browse Transparent SLI metrics and group and filter them by any of the above-mentioned attributes, presenting their mean, min, max, sum, standard deviation, count, and 5th, 50th, 95th, & 99th percentiles. With this, you can easily perform the analysis to determine which subsets of your app’s traffic to GCP services are seeing issues. When you find a view that’s particularly useful, you can save that chart on a Stackdriver custom dashboard that you can view again and again like the following:
An example dashboard for GCP services that groups metrics by service, method and response code. You can also view latency charts on a log scale to quickly find outliers.

Data is power

Transparent SLIs give you the ability to transform your cloud operations for the better. By helping you drill down into interactions between your software and our services, GCP service metrics can tell you whether our services are behaving abnormally for your app’s traffic to speed the problem triage process. Furthermore, when you’re communicating with Google tech support, you can direct them to these charts so that everyone is working from the same data and can agree as to what’s being experienced. By shortening triage time and back and forth with tech support, we can dramatically reduce resolution times.

Here are some examples of how using GCP service metrics can improve the support experience:
  • If all of your calls to a service are failing for a single credential ID, but not any other, chances are there’s something wrong with that account that you can fix yourself without opening a ticket.
  • You’re troubleshooting a problem with your app, and notice a correlation between your application’s degraded performance and a sustained increase in the 50th percentile latency of a critical GCP service. Definitely call us and point us to this data so we can start working on the problem as quickly as possible.
  • The latencies for a GCP service report look good and unchanged from before, but your in-app client-side metrics report that the latency on calls to the service is abnormally high. That suggests that there might be some trouble in the network. Call your network provider (in some cases, Google) to get the debugging process started.
Over time, we think Transparent SLIs’ fine-grained visibility and transparency may change how you think about your services. For every super-demanding latency-sensitive cloud service (e.g., memcache), there are lots of others for which scale and reliability matter much more. Some APIs, Google Cloud Storage or BigQuery for example, can take a of couple seconds at the high end without customers noticing. With data from GCP service metrics, the more you know about the range of typical performance, the easier it is to recognize the outliers.

Transparent SLIs may also help you understand that latency results for most services fall within a normal distribution: a big hump in the middle, and outliers on either side. The metrics will help you understand the normal distribution so that you can engineer your app to work well within the distribution curve. For example, the metrics can help you correlate distribution changes with times when your app is not working as intended, helping you find the root cause of an issue. We expect the 99th percentile to look very different than the median—what we don’t expect are dramatic changes in those percentiles over time. Thus, when investigating whether a GCP service is at fault for an application problem, you should examine the return codes and latency rates over time and look for sustained changes from the norm that are correlated with observed issues in your application(We suggest that you consider the last week to be the norm.)



Setting up dashboards for Transparent SLIs

To get started collecting and exploring Transparent SLIs, go to Stackdriver Metrics Explorer and select "Consumed API" as the resource type. Stackdriver then introspects your project and creates a list of metrics that you can chart based on the products and services you are using. You can then pick the metrics that make the most sense for your environment. You can narrow down the data you display by specifying which project or service you want to monitor. It may also be helpful to specify which credentials’ traffic to view so that you only monitor traffic from production applications and not from other sources.

Stackdriver Metrics Explorer supports availability and latency metrics, which you can combine with filters and aggregations for new and insightful views into your application performance. For example, you can combine a request count metric with a filter on the HTTP Response Code class to build a dashboard that shows error rates over time. Or you can look at the 95th percentile latency of requests to the Cloud Pub/Sub API.

Since the main use case for Transparent SLIs is to help you triage issues with your application and see if GCP services may be the cause, the ideal way to use this data is to mix our metrics with yours. If you have an app that is highly dependent on Cloud SQL, for example, don’t graph the SLIs for Cloud SQL on their own—create a chart with your app’s error rate as one line and the Cloud SQL error rate as another line on the same chart. Doing this allows you to see at a glance whether Cloud SQL errors are a likely cause of unavailability in your app.

Keep us honest

We here at Google Cloud are committed to transparency, and sharing metrics about our services is an important part of that ethic. By sharing them with you, you can easily check up on how we are doing, so that when we work together on a service ticket, everyone is on the same page. We think Transparent SLIs will radically improve your tech support experience and increase your confidence in Google Cloud. Try it out and let us know what you think!

Google Cloud and GitHub collaborate to make CI fast and easy



Today, Google Cloud and GitHub are delivering a new integrated experience that connects GitHub with Google’s Cloud Build, our new CI/CD platform. Together, we will provide fast, frictionless, and convenient Continuous Integration (CI) for any repository on GitHub, integrated directly into the GitHub developer workflow.

Millions of developers trust GitHub today to store and collaborate around source code. Working with GitHub, we realized we had an opportunity to help make it significantly easier for any repository to add CI, integrate DevOps practices, and improve velocity and productivity. We set out to build that together, and today’s release is the first step in that collaboration.

Continuous Integration drives developer productivity

“Continuous integration is a crucial element of modern software development, but historically one that has required development teams to invest significant effort in patching together disparate software products and services to build a working, streamlined pipeline. This is an area where partners with adjacent offerings can add real value by pre-integrating the necessary pieces to deliver a seamless experience. This is what GitHub and Google have set out to do.”
- Rachel Stephens, Analyst, RedMonk
Software development is built on trust. We work in teams and trust our fellow developers to write the right code together. We use open-source operating systems, tools, and libraries so we can focus on the code that we need to write. We trust cloud platforms so we can develop, test, run, and manage our applications securely, at scale. Google Cloud builds on that trust by developing and using open technologies such as Kubernetes, TensorFlow, and Go.

DevOps is also built on trust. Trust is what lets us go faster. We know that mistakes and errors happen and that we will learn from them. We create a culture of trust through transparency and data-driven decisions, through a spirit of shared-fate and blameless post-mortems for continuous improvement. We use automation everywhere, especially CI, to create a safety net. Trust in our tests and our tools lets us go faster. Cloud Build provides the DevOps tools to unleash developer productivity, and help teams go faster.

Collaborations are built on trust too. Google and GitHub have a long history of working together to make software development better for all developers. We have a shared belief in the principles and practices of open source, and a shared vision of productive developers and software teams. We have worked together on improvements to the Git client and protocol, as well as other projects. And Google uses GitHub too: Googlers contributed to nearly 30,000 repos on GitHub last year, some of which are among the most popular projects on GitHub.

Cloud Build and GitHub, better together

“GitHub is excited to partner with Google to make CI for cloud-native application development painless. The ability to use Cloud Build for CI as a part of the GitHub workflow is just the start of this partnership and we look forward to building more in the future with Google.”
- Jason Warner, SVP of Technology at GitHub (read more in GitHub’s blog post)
The integration of Cloud Build with GitHub makes it quick to adopt CI and validate changes by integrating code early and often, bringing a host of benefits to developers, directly from their GitHub workflow.

Zero-config Docker builds: In one step, you can run automated container builds and tests on changes pushed to a GitHub repository as a part of every pull request. GitHub will automatically detect and recommend CI for repositories that contain a Dockerfile.

Scalability: Cloud Build meets the growing needs of your organization. You can go from a single build on your local machine to multiple builds in parallel in the cloud across numerous projects, all in a matter of minutes.

Security: The builds run on infrastructure protected by Google’s security. You get full control over who can create and view your builds, what source code can be used, and where your build artifacts are stored.

Flexibility: For advanced use cases, you can include a cloudbuild.yaml file when setting up CI using Cloud Build. This lets you define custom build steps, speed up builds by caching a Docker image, build leaner containers, and deploy directly to Google Kubernetes Engine, Google App Engine, on-prem clusters (in alpha soon), or another cloud provider.

Insights: Once the build is complete, details about build times, failures and artifacts are available within GitHub through the Checks API, so you can understand and diagnose build results from within the familiar GitHub environment. Full logs and history are available in Cloud Build’s UI in the Google Cloud Console.

Join us

Today’s integration is already available in the GitHub Marketplace. Smart CI recommendations will be rolled out to all GitHub users on a phased basis. Please try it out, and share your feedback with us.

Google and GitHub have had a long relationship serving developers, and this is just the next step. We know there are many other ways we can make software development better for developers. We trust you’ll join us on this journey.

Accelerating software teams with Cloud Build



Software development has come a long way from the days of “it compiles, ship it!” Today’s software teams need to deliver more business value faster than ever—in an environment where the pace of change is accelerating. And while change can mean faster hardware, better security, and more features, it can also come at a cost: new vulnerabilities are discovered every day and seemingly innocuous updates can cause applications to break.

DevOps has learned a lot from manufacturing. The best time to catch and fix a problem is as early and automatically as possible. In software, a similar culture of continuous improvement is essential, along with new tools to automate best practices, like continuous integration and continuous delivery (CI/CD).

Many organizations have embraced CI/CD, but the engineering cost and complexity of operating and maintaining secure and reliable CI/CD infrastructure is high. Incorporating best practices takes time. These are resources better spent writing software. That’s why we introduced Cloud Build, a fully-managed CI/CD platform that lets you build and test applications in the cloud–at scale.
"We found Cloud Build to be feature rich yet also easy to learn and use. We use its parallelization and caching capabilities to speed up our container builds, and leverage its container analysis API to bless our images. Its reliability has allowed us to focus our attention on other areas."
- Riley Shott, Production Engineer at Shopify
In creating Cloud Build we worked with and listened to you, software developers from every walk of life, on teams of every size. We also spent time understanding what helped our own internal engineering teams be productive. Three things consistently stood out.

Scalability: No build is ever too quick. No test suite runs too fast. As a project grows over time and new developers join the team, your CI/CD system must keep up. Built on top of Google's cloud infrastructure, with a range of CPU sizes available and pay-for-what-you-use pricing, Cloud Build can grow with your organization.

Flexibility: Software development is an increasingly complex web of ever-changing frameworks, dependencies, services, languages, and tools. Your applications are deployed across multiple clouds, on-premise resources and mobile app stores. To support your development needs, Cloud Build works with major source repositories like GitHub, GitLab, Cloud Source Repositories, and BitBucket. It also features built-in support for Docker, Maven, Gradle, Bazel, Go, and npm. An ecosystem of add-ons and the ability to bring your own tasks and toolchains as containers makes integrating into your existing developer workflow easy. You can use Cloud Build for hybrid scenarios with VPC networking and custom workers (in alpha).

Security: Security isn’t just for runtimes, it’s a full lifecycle problem that extends into every tool and pipeline you use. Cloud Build uses GCP’s world-class security and policy controls so you have control and visibility of your source and build. Cloud Build runs every build on its own VM, which reduces the risk of information leaking between builds or build errors caused by inconsistent build environments. Vulnerability scanning automatically finds known vulnerabilities in your container images (in alpha for Ubuntu, Debian, and Alpine).

As Rob Pike describes it, “Software engineering is what happens to programming when you add time and other programmers.” Striking a balance between time, quality, velocity and security is hard—but not insurmountable. The key to this balance is trust. When you can trust your tools as a safety net and your culture as a compass it’s much easier to take risks and move fast. Cloud Build makes high velocity software development safer and easier, and unleashes your team’s productivity -- try it out today!