Tag Archives: Compute

Solution guide: Migrating your dedicated game servers to Google Cloud Platform



One of the greatest challenges for game developers is to accurately predict how many players will attempt to get online at the game's launch. Over-estimate, and risk overspending on hardware or rental commitments. Under-estimate, and players leave in frustration, never to return. Google Cloud can help you mitigate this risk while giving you access to the latest cloud technologies. Per-minute billing and automatically applied sustained use discounts can take the pain out of up-front capital outlays or trying to play catch-up while your player base shrinks.

The advantages for handling spikey launch-day demand are clear, but Google Cloud Platform's extensive network of regions also puts servers near high-latency customers. Game studios no longer need to do an expensive datacenter buildout to offer a best-in-class game experience  just request Google Compute Engine resources where they're needed, when they're needed. With new regions coming online every year, you can add game servers near your players with a couple of clicks.

We recently published our "Dedicated Game Server Migration Guide" that outlines Google Cloud Platform’s (GCP) many advantages and differentiators for gaming workloads, and best practices for running these processes that we've learned working with leading studios and publishers. It covers the whole pipeline, from creating projects and getting your builds to the cloud, to distributing them to your VMs and running them, to deleting environments wholesale when they're no longer needed. Running game servers in Google Cloud has never been easier.

Google Container Engine fires up Kubernetes 1.6



Today we started to make Kubernetes 1.6 available to Google Container Engine customers. This release emphasizes significant scale improvements and additional scheduling and security options, making the running of a Kubernetes clusters on Container Engine easier than ever before.

There were over 5,000 commits in Kubernetes 1.6 with dozens of major updates that are now available to Container Engine customers. Here are just a few highlights from this release:
  • Increase in number of supported nodes by 2.5 times: We’ve made great effort to support your workload no matter how large your needs. Container Engine now supports cluster sizes of up to 5,000 nodes, up from 2,000, while still maintaining our strict SLO for cluster performance. We've already had some of the world's most popular apps hosted on Container Engine (such as Pokémon GO) and the increase in scale can handle more of the largest workloads.
  • Fully Managed Nodes: Container Engine has always helped keep your Kubernetes master in a healthy state; we're now adding the option to fully manage your Kubernetes nodes as well. With Node Auto-Upgrade and Node Auto-Repair, you can optionally have Google automatically update your cluster to the latest version, and ensure your cluster’s nodes are always operating correctly. You can read more about both features here.
  • General Availability of Container-Optimized OS: Container Engine was designed to be a secure and reliable way to run Kubernetes. By using Container-Optimized OS, a locked down operating system specifically designed for running containers on Google Cloud, we provide a default experience that's more secure, highly performant and reliable, helping ensure your containerized workloads can run great. Read more details about Container-Optimized OS in this in-depth post here.
Over the past year, Kubernetes adoption has accelerated and we could not be more proud to host so many mission critical applications on the platform for our customers. Some recent highlights include:

Customers

  • eBay uses Google Cloud technologies including Container Engine, Cloud Machine Learning and AI for its ShopBot, a personal shopping bot on Facebook Messenger.
  • Smyte participated in the Google Cloud startup program and protects millions of actions a day on websites and mobile applications. Smyte recently moved from self-hosted Kubernetes to Container Engine.
  • Poki, a game publisher startup, moved to Google Cloud Platform (GCP) for greater flexibility, empowered by the openness of Kubernetes. A theme we covered at our Google Cloud Next conference, showing that open source technology gives customers the freedom to come and go as they choose. Read more about their decision to switch here.
While Kubernetes did nudge us in the direction of GCP, we’re more cloud agnostic than ever because Kubernetes can live anywhere.”  — Bas Moeys, Co-founder and Head of Technology at Poki

To help shape the future of Kubernetes — the core technology Container Engine is built on — join the open Kubernetes community and participate via the kubernetes-users-mailing list or chat with us on the kubernetes-users Slack channel.

We’re the first cloud to offer users the newest Kubernetes release, and with our generous 12 month free trial of $300 credits, it’s never been simpler to get started, try the latest release today.



Google App Engine flexible environment now available from europe-west region



A few weeks ago we shared some big news on the Google App Engine flexible environment. Today, we’re excited to announce our first new region since going GA: App Engine flexible environment is now available in the europe-west region. This release makes it easier than ever for App Engine developers to reach customers all around the world.

To get started, simply open up the Developers Console and create a new project, and select App Engine. After choosing a language, you can now specify the location as europe-west. Note that once a project is created, its region cannot be changed.

You can also create your application from the command line using the latest version of the Cloud SDK:

gcloud app create --region europe-west

To learn more about the services offered in each location, as well as best practices for deploying your applications and saving your data across different regions and zones, check out our Cloud Locations and Geography and Regions pages.

Enterprise Slack apps on Google Cloud–now easier than ever



Slack recently announced a new, streamlined path to building apps, opening the door to corporate engineers to build fully featured internal integrations for companies of all sizes.

You can now make an app that supports any Slack API feature such as message buttons, threads and the Events API without having to enable app distribution. This means you can keep the app private to your team as an internal integration.
With support for the Events API in internal integrations, you can now use platforms like Google App Engine or Cloud Functions to host a Slack bot or app just for your team. Even if you're building an app for multiple teams, internal integrations let you focus on developing your app logic first and wait to implement the OAuth2 flow for distribution until you're ready.

We've updated the Google Cloud Platform samples for Slack to use this new flow. With samples for multiple programming languages, including Node.js, Java, and Go, it's easier than ever to get started building Slack apps on Google Cloud Platform (GCP).

Slack bots also made an appearance at Google Cloud Next '17. Check out the video for best practices for building bots for the enterprise from Amir Shevat, head of developer relations at Slack, and Alan Ho from Google Cloud.


Questions? Comments? Come chat with us on the #bots channel in the Google Cloud Platform Slack community.

Cloud KMS GA, new partners expand encryption options



As you heard at Google Cloud Next ‘17, our Cloud Key Management Service (KMS) is now generally available. Cloud KMS makes it even easier for you to encrypt data at scale, manage secrets and protect your data the way you want  both in the cloud and on-premise. Today, we’re also announcing a number of partner options for using Customer-Supplied Encryption Keys.

Cloud KMS is now generally available.

With Cloud KMS, you can manage symmetric encryption keys in a cloud-hosted solution, whether they’re used to protect data stored in Google Cloud Platform (GCP) or another environment. You can create, use, rotate and destroy keys via our Cloud KMS API, including as part of a secret management or envelope encryption solution. Further, Cloud KMS is directly integrated with Cloud Identity Access Management and Cloud Audit Logging for greater control over your keys.

As we move out of beta, we’re introducing an availability SLA, so you can count on Cloud KMS for your production workloads. We’ve load tested Cloud KMS extensively, and reduced latency so that Cloud KMS can sit in the serving path of your requests.

Ravelin, a fraud detection provider, has continued their use of Cloud KMS to encrypt secrets stored locally, including configurations and authentication credentials, used for both customer transactions and internal systems and processes. Using Cloud KMS allows Ravelin to easily encrypt these secrets for storage.
“Encryption is absolutely critical to any company managing their own systems, transmitting data over a network or storing sensitive data, including sensitive system configurations. Cloud KMS makes it easy to implement best practices for secret management, and its low latency allows us to use it for protecting frequently retrieved secrets. Cloud KMS gives us the cryptographic tools necessary to protect our secrets, and the features to keep encryption practical.”  Leonard Austin, CTO at Ravelin. 

Managing your secrets in Google Cloud


We’ve published recommendations on how to manage your secrets in Google Cloud. Most development teams have secrets that they need to manage at build or run time, such as API keys. Instead of storing those secrets in source code, or in metadata, for many cases we suggest you store secrets encrypted at rest in a Google Cloud Storage bucket, and use Cloud KMS to encrypt those secrets at rest.

Customer-Supplied Encryption Key partners


You now have several partner options for using Customer-Supplied Encryption Keys. Customer-Supplied Encryption Keys (or CSEK, available for Google Cloud Storage and Compute Engine) allow you to provide a 256-bit string, such as an AES encryption key, to protect your data at rest. Typically, customers use CSEK when they have stricter regulatory needs, or need to provide their own key material.

To simplify the use of this unique functionality, our partners Gemalto, Ionic, KeyNexus, Thales and Virtru, can generate CSEK keys in the appropriate format. These partners make it easier to generate an encryption key for use with CSEK, and to associate that key to an object in Cloud Storage or a persistent disk, image or instance in Compute Engine. Each partner brings differentiated features and value to the table, which they describe in their own words below.

Gemalto
“Gemalto is dedicated to multi-cloud enterprise key management by ensuring customers have the best choices to maintain high assurance key ownership and control as they migrate operations, workloads and data to the cloud. Gemalto KeySecure has supported Client-Side Encryption with Google Cloud Storage for years, and is now extending support for Customer Supplied Encryption Keys (CSEK)." Todd Moore SVP of Encryption Products at Gemalto

Ionic
"We are excited to announce the first of many powerful capabilities leveraging Google's Customer Supplied Encryption Keys (CSEK). Our new Ionic Protect for Cloud Storage solution enables developers to simply and seamlessly use their own encryption keys with the full capabilities of the Ionic platform while natively leveraging Google Cloud Storage.”  Adam Ghetti, Founder and CEO of Ionic

KeyNexus
"KeyNexus helps customers supply their own keys to encrypt their most sensitive data across Google Cloud Platform as well as hundreds of other bring-your-own-key (BYOK) use cases spanning SaaS, IaaS, mobile and on-premise, via secure REST APIs. Customers choose KeyNexus as a centralized, platform-agnostic, key management solution which they can deploy in numerous highly available, scalable and low latency cloud or on-premise configurations. Using KeyNexus, customers are able to supply keys to encrypt data server-side using Customer-Supplied Encryption Keys (CSEKs) in Google Cloud Storage and Google Compute Engine"  Jeff MacMillan, CEO of KeyNexus

Thales
“Protected by FIPS 140-2 Level 3 certified hardware, the Thales nShield HSM uses strong methods to generate encryption keys based on its high-entropy random number generator. Following generation, nShield exports customer keys into the cloud for one-time use via Google’s Customer-Supplied Encryption Key functionality. Customers using Thales nShield HSMs and leveraging Google Cloud Platform can manage their encryption keys from their own environments for use in the cloud, giving them greater control over key material” Sol Cates, Vice President Technical Strategy at Thales e-Security

Virtru
Virtru offers business privacy, encryption and data protection for Google Cloud. Virtru lets you choose where your keys are hosted and how your content is encrypted. Whether for Google Cloud Storage, Compute Engine or G Suite, you can upload Virtru-generated keys to Google’s CSEK or use Virtru’s client-side encryption to protect content before upload. Keys may be stored on premise or in any public or private cloud."  John Ackerly, Founder and CEO of Virtru

Encryption by default, and more key management options


Recall that by default, GCP encrypts customer content stored at rest, without any action required from the customer, using one or more encryption mechanisms using keys managed server-side.

Google Cloud provides you with options to choose the approach that best suits your needs. If you prefer to manage your cloud-based keys yourself, select Cloud KMS; and if you’d like to manage keys with a partner or on-premise, select Customer-Supplied Encryption Keys.
Safe computing!

Google Cloud Functions: a serverless environment to build and connect cloud services



Developers rely on many cloud services to build their apps today: everything from storage and messaging services like Google Cloud Storage and Google Cloud Pub/Sub and mobile development platforms like Firebase, to data and analytics platforms like Google Cloud Dataflow and Google BigQuery. As developers consume more cloud services from their applications, it becomes increasingly complex to coordinate them and ensure they all work together seamlessly. Last week at Google Cloud Next '17, we announced the public beta of a new capability for Google Cloud Platform (GCP) called Google Cloud Functions that allows developers to connect services together and extend their behavior with code, or to build brand new services using a completely serverless approach.

With Cloud Functions you write simple, single-purpose functions that are attached to events emitted from cloud services. Your Cloud Function is triggered when an event being watched is fired. Your code executes in a fully managed environment and can effectively connect or extend services in Google’s cloud, or services in other clouds across the internet; no need to provision any infrastructure or worry about managing servers. A function can scale from a few invocations a day to many millions of invocations without any work from you, and you only pay while your function is executing.

Asynchronous workloads like lightweight ETL, or cloud automation tasks such as triggering an application build no longer require an always-on server that's manually connected to the event source. You simply deploy a Cloud Function bound to the event you want and you're done.
"Semios uses Google Cloud Functions as a critical part of our data ingestion pipeline, which asynchronously aggregates micro-climate telemetry data from our IoT network of 150,000 in-field sensors to give growers real-time insights about their orchards."
— Maysam Emadi, Data Scientist, Semios
Cloud Function’s fine-grained nature also makes it a perfect candidate for building lightweight APIs, microservices and webhooks. HTTP endpoints are automatically configured when you deploy a function you intend to trigger using HTTP — no complicated configuration (or integration with other products) required. Simply deploy your function with an HTTP trigger, and we'll give you back a secure URL you can curl immediately.
"At Vroom, we work with a number of partners to market our services and provide us with leads. Google Cloud Functions makes integration with these partners as simple as publishing a new webhook, which scales automatically with use, all without having to manage a single machine." — Benjamin Rothschild, Director of Analytics, Vroom
If you're a mobile developer using Firebase, you can now connect your Firebase app to one or more Cloud Functions by binding a Cloud Function to mutation events in the Firebase Realtime Database, events from Firebase Authentication, and even execute a Cloud Function in response to a conversion event in Firebase Analytics. You can find out more about this Firebase integration at https://firebase.google.com/features/functions.

Cloud Functions also empowers developers to quickly and easily build messaging bots and create custom actions for Google Assistant.
“At Meetup, we wanted to improve developer productivity by integrating task management with Slack. Google Cloud Functions made this integration as simple as publishing a new HTTP function. We’ve now rolled the tool out across the entire organization without ever touching a server or VM.” — Jose Rodriguez, Lead of Engineering Effectiveness, Meetup
In our commitment to openness, Cloud Functions uses only standard, off-the-shelf runtimes and doesn’t require any proprietary modules or libraries in your code: your functions will just work. In addition, the execution environment doesn't rely on a proprietary or forked operating system, which means your dependencies have native library compatibility. We currently support the Node.js runtime and have a set of open source Node.js client libraries for connecting to a wide range of GCP services.

As part of the built-in deployment pipeline we'll resolve all dependencies by running npm install for you (or npm rebuild if you provide packages that require compilation), so you don't have to worry about building for a specific environment. We also have an open source local emulator so you can build and quickly iterate on your Cloud Functions from your local machine.
"Node.js is continually growing across the cloud, especially when it comes to the container and serverless space. This new offering from Google, built in collaboration with the open source community, will provide even more options to the Node.js community going forward.” — Mikeal Rogers, Community Manager, Node.js Foundation
Head over to our quickstart guide to dive right in! Best of all, we've created a generous free tier to allow you to experiment, prototype and play with the product without spending a dime. You can find out more on our pricing page.

We look forward to seeing what you create with Cloud Functions. We’d love to hear your feedback on StackOverflow.

Google Cloud Platform: your Next home in the cloud



San Francisco Today at Google Cloud Next ‘17, we’re thrilled to announce new Google Cloud Platform (GCP) products, technologies and services that will help you imagine, build and run the next generation of cloud applications on our platform.

Bring your code to App Engine, we’ll handle the rest

In 2008, we launched Google App Engine, a pioneering serverless runtime environment that lets developers build web apps, APIs and mobile backends at Google-scale and speed. For nearly 10 years, some of the most innovative companies built applications that serve their users all over the world on top of App Engine. Today, we’re excited to announce into general availability a major expansion of App Engine centered around openness and developer choice that keeps App Engine’s original promise to developers: bring your code, we’ll handle the rest.

App Engine now supports Node.js, Ruby, Java 8, Python 2.7 or 3.5, Go 1.8, plus PHP 7.1 and .NET Core, both in beta, all backed by App Engine’s 99.95% SLA. Our managed runtimes make it easy to start with your favorite languages and use the open source libraries and packages of your choice. Need something different than what’s out of the box? Break the glass and go beyond our managed runtimes by supplying your own Docker container, which makes it simple to run any language, library or framework on App Engine.

The future of cloud is open: take your app to-go by having App Engine generate a Docker container containing your app and deploy it to any container-based environment, on or off GCP. App Engine gives developers an open platform while still providing a fully managed environment where developers focus only on code and on their users.


Cloud Functions public beta at your service

Up one level from fully managed applications, we’re launching Google Cloud Functions into public beta. Cloud Functions is a completely serverless environment to build and connect cloud services without having to manage infrastructure. It’s the smallest unit of compute offered by GCP and is able to spin up a single function and spin it back down instantly. Because of this, billing occurs only while the function is executing, metered to the nearest one hundred milliseconds.

Cloud Functions is a great way to build lightweight backends, and to extend the functionality of existing services. For example, Cloud Functions can respond to file changes in Google Cloud Storage or incoming Google Cloud Pub/Sub messages, perform lightweight data processing/ETL jobs or provide a layer of logic to respond to webhooks emitted by any event on the internet. Developers can securely invoke Cloud Functions directly over HTTP right out of the box without the need for any add-on services.

Cloud Functions is also a great option for mobile developers using Firebase, allowing them to build backends integrated with the Firebase platform. Cloud Functions for Firebase handles events emitted from the Firebase Realtime Database, Firebase Authentication and Firebase Analytics.

Growing the Google BigQuery universe: introducing BigQuery Data Transfer Service

Since our earliest days, our customers turned to Google to promote their advertising messages around the world, at a scale that was previously unimaginable. Today, those same customers want to use BigQuery, our powerful data analytics service, to better understand how users interact with those campaigns. With that, we’ve developed deeper integration between broader Google and GCP with the public beta of the BigQuery Data Transfer Service, which automates data movement from select Google applications directly into BigQuery. With BigQuery Data Transfer Service, marketing and business analysts can easily export data from Adwords, DoubleClick and YouTube directly into BigQuery, making it available for immediate analysis and visualization using the extensive set of tools in the BigQuery ecosystem.

Slashing data preparation time with Google Cloud Dataprep

In fact, our goal is to make it easy to import data into BigQuery, while keeping it secure. Google Cloud Dataprep is a new serverless browser-based service that can dramatically cut the time it takes to prepare data for analysis, which represents about 80% of the work that data scientists do. It intelligently connects to your data source, identifies data types, identifies anomalies and suggests data transformations. Data scientists can then visualize their data schemas until they're happy with the proposed data transformation. Dataprep then creates a data pipeline in Google Cloud Dataflow, cleans the data and exports it to BigQuery or other destinations. In other words, you can now prepare structured and unstructured data for analysis with clicks, not code. For more information on Dataprep, apply to be part of the private beta. Also, you’ll find more news about our latest database and data and analytics capabilities here and here.

Hello, (more) world

Not only are we working hard on bringing you new products and capabilities, but we want your users to access them quickly and securely  wherever they may be. That’s why we’re announcing three new Google Cloud Platform regions: California, Montreal and the Netherlands. These will bring the total number of Google Cloud regions up from six today, to more than 17 locations in the future. These new regions will deliver lower latency for customers in adjacent geographic areas, increased scalability and more disaster recovery options. Like other Google Cloud regions, the new regions will feature a minimum of three zones, benefit from Google’s global, private fibre network and offer a complement of GCP services.

Supercharging our infrastructure . . .

Customers run demanding workloads on GCP, and we're constantly striving to improve the performance of our VMs. For instance, we were honored to be the first public cloud provider to run Intel Skylake, a custom Xeon chip that delivers significant enhancements for compute-heavy workloads and a larger range of VM memory and CPU options.

We’re also doubling the number of vCPUs you can run in an instance from 32 to 64 and now offering up to 416GB of memory, which customers have asked us for as they move large enterprise applications to Google Cloud. Meanwhile, we recently began offering GPUs, which provide substantial performance improvements to parallel workloads like training machine learning models.

To continually unlock new energy sources, Schlumberger collects large quantities of data to build detailed subsurface earth models based on acoustic measurements, and GCP compute infrastructure has the unique characteristics that match Schlumberger's needs to turn this data into insights. High performance scientific computing is integral to its business, so GCP's flexibility is critical.

Schlumberger can mix and match GPUs and CPUs and dynamically create different shapes and types of virtual machines, choosing memory and storage options on demand.

"We are now leveraging the strengths offered by cloud computation stacks to bring our data processing to the next level. Ashok Belani, Executive Vice President Technology, Schlumberger

. . . without supercharging our prices

We aim to keep costs low. Today we announced Committed Use Discounts that provide up to 57% off the list price on Google Compute Engine, in exchange for a one or three year purchase commitment. Committed Use Discounts are based on the total amount of CPU and RAM you purchase, and give you the flexibility to use different instance and machine types; they apply automatically, even if you change instance types (or size). There are no upfront costs with Committed Use Discounts, and they are billed monthly. What’s more, we automatically apply Sustained Use Discounts to any additional usage above a commitment.

We're also dropping prices for Compute Engine. The specific cuts vary by region. Customers in the United States will see a 5% price drop; customers in Europe will see a 4.9% drop and customers using our Tokyo region an 8% drop.

Then there’s our improved Free Tier. First, we’ve extended the free trial from 60 days to 12 months, allowing you to use your $300 credit across all GCP services and APIs, at your own pace and on your own schedule. Second, we’re introducing new Always Free products  non-expiring usage limits that you can use to test and develop applications at no cost. New additions include Compute Engine, Cloud Pub/Sub, Google Cloud Storage and Cloud Functions, bringing the number of Always Free products up to 15, and broadening the horizons for developers getting started on GCP. Visit the Google Cloud Platform Free Tier page today for further details, terms, eligibility and to sign up.

We'll be diving into all of these product announcements in much more detail in the coming days, so stay tuned!

Partnering on open source: Google and HashiCorp engineers on managing GCP infrastructure



Earlier in January, we shared the first episode of a video mini-series highlighting how the Google Cloud Graphite team is making open source software work great with the Google Cloud Platform (GCP). Today, we’re kicking off the next chapter of the series, featuring HashiCorp’s open-source DevOps tools and how to use them with GCP.

HashiCorp open source tools simplify application delivery, helping users provision, secure and run infrastructure for any applications. We kick off the series with a high-level overview, featuring Kelsey Hightower, Staff Developer Advocate for GCP, and Armon Dadgar, CTO and co-founder of HashiCorp.


Then, for our next installment, we show HashiCorp and GCP in action. Imagine a small, independent game studio working on its next title  a retro 1980s style arcade game updated for multiplayer and playable over the web. Watch as the team engages in collaborative development, demos the game to their CEO and deploys it for public release. Along the way, we feature:
  • Vagrant, which allows developers to create repeatable development environments to be used by any member of a team without consulting operators. Vagrant can easily spin up remote VMs on Google Compute Engine and allows developers shared access to the same VM  ideal for collaborative development.
  • Packer, which with a single configuration file, produces machine images for many target environments, including Compute Engine. The ease with which Packer images can be easily described and built make it an ideal fit with DevOps concepts such as immutable infrastructure and continuous delivery.
  • Terraform, which helps operators safely and predictably create, modify and destroy production infrastructure. It codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed and versioned. Operators can thus manage GCP resources spanning many products  key when provisioning scalable production infrastructure.
Join us on YouTube to watch other episodes that will cover topics including using machine images to deploy or using infrastructure as code to manage resources. Follow Google Cloud on YouTube, or @GoogleCloud on Twitter to find out when new videos are published. And stay tuned for more blog posts and videos about work we’re doing with open-source providers like Puppet, Chef, Cloud Foundry, Red Hat, SaltStack and others.

Incident management at Google — adventures in SRE-land



Have you ever wondered what happens at Google when something goes wrong? Our industry is fond of using colorful metaphors such as “putting out fires” to describe what we do.
Of course, unlike the actual firefighters pictured here, our incidents don’t normally involve risk to life and limb. Despite the imperfect metaphor, Google Site Reliability Engineers (SREs) have a lot in common with other first responders in other fields.

Like these other first responders, SREs at Google regularly practice emergency response, honing the skills, tools, techniques and attitude required to quickly and effectively deal with the problem at hand.

In emergency services, and at Google, when something goes wrong, it's called an “incident.”

This is the story of my first “incident” as a Google SRE.

Prologue: preparation


For the past several months, I’ve been on a Mission Control rotation with the Google Compute Engine SRE team. I did one week of general SRE training. I learned about Compute Engine through weekly peer training sessions, and by taking on project work. I participated in weekly “Wheel of Misfortune” sessions, where we're given a typical on-call problem and try to solve it. I shadowed actual on-callers, helping them respond to problems. I was secondary on-call, assisting the primary with urgent issues, and handling less urgent issues independently.

Sooner or later, after all the preparation, it’s time to be at the sharp end. Primary on-call. The first responder.

Editor's Note: Chapter 28 “Accelerating SREs to On-Call and Beyond” in Site Reliability Engineering goes into detail about how we prepare new SREs to be ready to be first responders.

Going on-call

There's a lot more to being an SRE than being on-call. On-call is, by design, a minority of what Site Reliability Engineers (SREs) do, but it's also critical. Not only because someone needs to respond when things go wrong, but because the experience of being on-call informs many other things we do as SREs.

During my first on-call shifts, our alerting system saw fit to page1 me twice, and two other problems were escalated to me by other people. With each page, I felt a hit of adrenaline. I wondered "Can I handle this? What if I can’t?" But then I started to work the problem in front of me, like I was trained to, and I remembered that I don’t need to know everything  there are other people I can call on, and they will answer. I may be on point, but I’m not alone.

Editor’s Note: Chapter 11 “Being On-Call” in Site Reliability Engineering has lots of advice on how to organize on-call duties in a way that allows people to be effective over the long term.

It’s an incident!

Three of the pages I received were minor. The fourth was more, shall we say. . . interesting?

Another Google engineer using Compute Engine for their service had a test automation failure, and upon investigation noticed something unusual with a few of their instances. They notified the development team’s primary on-call, Parya, and she brought me into the loop. I reached out to my more experienced secondary, Benson, and the three of us started to investigate, along with others from the development team who were looped in. Relatively quickly we determined it was a genuine problem. Having no reason to believe that the impact was limited to the single internal customer who reported the issue, we declared an incident.

What does declaring an incident mean? In principle it means that an issue is of sufficient potential impact, scope and complexity that it will require a coordinated effort with well defined roles to manage it effectively. At some point, everything you see on the summary page of the Google Cloud Status Dashboard was declared an incident by someone at Google. In practice, declaring an incident at Google means creating a new incident in our internal incident management tool.

As part of my on-call training, I was trained on the principles behind Google’s incident management protocol, and the internal tool that we use to facilitate incident response. The incident management protocol defines roles and responsibilities for the individuals involved. Earlier I asserted that Google SREs have a lot in common with other first responders. Not surprisingly, our incident management process was inspired by, and is similar to, well established incident command protocols used in other forms of emergency response.

My role was Incident Commander. Less than seven minutes after I declared the incident, a member of our support team took on the External Communications role. In this particular incident, we did not declare any other formal roles, but in retrospect, Parya was the Operations Lead; she led the efforts to root-cause the issue, pulling in others as needed. Benson was the Assistant Incident Commander, as I asked him a series of questions of the form “I think we should do X, Y and Z. Does that sound reasonable to you?”

One of the keys to effective incident response is clear communication between incident responders, and others who may be affected by the incident. Part of that equation is the incident management tool itself, which is a central place that Googlers can go to know about any ongoing incidents with Google services. The tool then directs Googlers to additional relevant resources, such as an issue in our issue-tracking database that contains more details, or the communications channels being used to coordinate the incident response.

Editor’s Note: Chapters 12, 13 and 14 of Site Reliability Engineering discuss effective troubleshooting, emergency response and managing oncidents respectively.

The rollback — an SRE’s fire extinguisher


While some of us worked to understand the scope of the issue, others looked for the proximate and root causes so we could take action to mitigate the incident. The scope was determined to be relatively limited, and the cause was tracked down to a particular change included in a release that was currently being rolled out.

This is quite typical. The majority of problems in production systems are caused by changing something  a new configuration, a new binary, or a service you depend on doing one of those things. There are two best practices that help in this very common situation.

First, all non-emergency changes should use a progressive rollout, which simply means don’t change everything at once. This gives you the time to notice problems, such as the one described here, before they become big problems affecting large numbers of customers.

Second, all rollouts should have a well understood and well tested rollback mechanism. This means that once you understand which change is responsible for the problem, you have an “undo” button you can press to restore service.

Keeping your problems small using a progressive rollout, and then mitigating them quickly via a trusted rollback mechanism are two powerful tools in the quest to meet your Service Level Objectives (SLOs).

This particular incident followed this pattern. We caught the problem while it was small, and then were able to mitigate it quickly via a rollback.

Editor’s Note: Chapter 36 “A Collection of Best Practices for Production Services” in Site Reliability Engineering talks more about these, and other, best practices.

Epilogue: the postmortem


With the rollback complete, and the problem mitigated, I declared the incident “closed.” At this point, the incident management tool helpfully created a postmortem document for the incident responders to collaborate on. Taking our firefighting analogy to its logical conclusion, this is analogous to the part where the fire marshal analyzes the fire, and the response to the fire, to see how similar fires could be prevented in the future, or handled more effectively.

Google has a blameless postmortem culture. We believe that when something goes wrong, you should not look for someone to blame and punish. Chances are the people in the story were well intentioned, competent and doing the best they could with the information they had at the time. If you want to make lasting change, and avoid having similar problems in the future, you need to look to how you can improve the systems, tools and processes around the people, such that a similar problem simply can’t happen again.

Despite the relatively limited impact of the incident, and the relatively subtle nature of the bug, the postmortem identified nine specific follow-up actions that could potentially avoid the problem in the future, or allow us to detect and mitigate it faster if a similar problem occurs. These nine issues were all filed in our bug tracking database, with owners assigned, so they'll be considered, researched and followed up on in the future.

The follow-up actions are not the only outcome of the postmortem. Since every incident at Google has a postmortem, and since we use a common template for our postmortem documents, we can perform analysis of overall trends. For example, this is how we know that a significant fraction of incidents at Google come from configuration changes. (Remember this the next time someone says “but it’s just a config change” when trying to convince you that it’s a good idea to push it out late on the Friday before a long weekend . . .)

Postmortems are also shared within the teams involved. On the Compute Engine team, for example, we have a weekly incident review meeting, where incident responders present their postmortem to a broader group of SREs and developers who work on Compute Engine. This helps identify additional follow up items that may have been overlooked, and shares the lessons learned with the broader team, making everyone better at thinking about reliability from these case studies. It's also a very strong way to reinforce Google’s blameless post mortem culture. I recall one of these meetings where the person presenting the postmortem attempted to take blame for the problem. The person running the meeting said “While I appreciate your willingness to fall on your sword, we don’t do that here.”

The next time you read the phrase “We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence” on our status page, I hope you'll remember this story. Having experienced firsthand the way we follow up on incidents at Google, I can assure you that it's not an empty promise.

Editor's Note: Chapter 15, “Postmortem Culture: Learning from Failure” in Site Reliability Engineering discusses postmortem culture in depth.



1 We don’t actually use pagers anymore of course, but we still call it “getting paged” no matter what device or communications channel is used.

Delivering a better platform for your SQL Server Enterprise workloads



Our goal at Google Cloud Platform (GCP) is to be the best enterprise cloud environment. Throughout 2016, we worked hard to ensure that Windows developers and IT administrators would feel right at home when they came to GCP: whether it’s building an ASP.NET application with their favorite tools like Visual Studio and PowerShell, or deploying the latest version of Windows Server onto Google Compute Engine.

Continuing our work in providing great infrastructure for enterprises running Windows, we’re pleased to announce pre-configured images for Microsoft SQL Server Enterprise and Windows Server Core on Compute Engine. High-availability and disaster recovery are top of mind for our larger customers, so we’re also announcing support for SQL Server AlwaysOn Availability Groups and persistent disk snapshots integrated with Volume Shadow Copy Service (VSS) on Windows Server. Finally, all of our Windows Server images are now enabled with Windows Remote Management support, including our Windows Server Core 2016 and 2012 R2 images.

SQL Server Enterprise Edition images on GCE


You can now launch Compute Engine VMs with Microsoft SQL Server Enterprise Edition pre-installed, and pay by the minute for SQL Server Enterprise and Windows Server licenses. Customers can also choose to bring their own licenses for SQL Server Enterprise.

We now support pre-configured images for the following versions in Beta:

  • SQL Server Enterprise 2016
  • SQL Server Enterprise 2014
  • SQL Server Enterprise 2012 
Supported SQL Server images available on Compute Engine (click to enlarge)

SQL Server Enterprise
targets mission-critical workloads by supporting more cores, higher memory and important enterprise features, including:

  • In-memory tables and indexes
  • Row-level security and encryption for data at rest or in motion
  • Multiple read-only replicas for integrated HA/DR and read scale-out
  • Business intelligence and rich visualizations on all platforms, including mobile
  • In-database advanced analytics with R


Combined with Google’s world-class infrastructure, SQL Server instances running on Compute Engine benefit from price-to-performance advantages, highly customizable VM sizes and state-of-the-art networking and security capabilities. With automatic sustained use discounts and the prospect of retiring hardware and associated maintenance on the horizon, customers can achieve total costs lower than those of other cloud providers.

To get started, learn how to create SQL Server instances easily on Google Compute Engine.



High-availability and disaster recovery for SQL Server VMs


Mission-critical SQL Server workloads require support for high-availability and disaster recovery. To achieve this, GCP supports Windows Server Failover Clustering (WSFC) and SQL Server AlwaysOn Availability Groups. AlwaysOn Availability Groups is SQL Server’s flagship HA/DR solution, allowing you to configure replicas for automatic failover in case of failure. These replicas can be readable, allowing you to offload read workloads and backups.

Compute Engine users can now configure AlwaysOn Availability Groups. This includes configuring replicas on VMs in different isolated zones as described in these instructions.
A highly available SQL Server reference architecture using Windows Server Failover Clustering and SQL Server AlwaysOn Availability Groups (click to enlarge)


Better backups with VSS-integrated persistent disk snapshots for Windows VMs


Being able to take snapshots in coordination with Volume Shadow Copy Service ensures that you get application-consistent snapshots for persistent disks attached to an instance running Windows -- without having to shut it down. This feature is useful when you want to take a consistent backup for VSS-enabled applications like SQL Server and Exchange Server without affecting the workload running on the VMs.

To get started with VSS-enabled persistent disk snapshots, select Snapshots under the Cloud Console Compute Engine page. There you'll see a new check-box on the disk snapshot creation page that allows you to specify whether a snapshot should be VSS-enabled.
(click to enlarge)

This feature can also be invoked via the gcloud SDK and API, following these instructions.

Looking ahead


GCP’s expanded support for SQL Server images and high availability are our latest efforts to improve Windows support on Compute Engine, and to build a cloud environment for enterprise Windows that leads the industry. Last year we expanded our list of pre-configured images to include SQL Server Standard, SQL Server Web and Windows Server 2016, and announced comprehensive .NET developer solutions, including a .NET client library for all GCP APIs through NuGet. We have lots more in store for the rest of 2017!

For more resources on Windows Server and Microsoft SQL Server on GCP, check out cloud.google.com/windows and cloud.google.com/sql-server. And for hands-on training on how to deploy and manage Windows and SQL Server workloads on GCP, come to the GCP NEXT ‘17 Windows Bootcamp. Finally, if you need help migrating your Windows workloads, don’t hesitate to contact us. We’re eager to hear your feedback!