Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Guest post: Using GCP for massive drug discovery virtual screening

By Woody Sherman, CSO and Vipin Sachdeva, Principal Investigator, Silicon Therapeutics

[Editor’s note: Today we hear from Boston, MA-based Silicon Therapeutics, which is applying computational methods in the context of complex biochemical problems relevant in human biology.]

As an integrated computational drug discovery firm, we recently deployed our INSITE Screening platform on Google Cloud Platform (GCP) to analyze over 10 million commercially available molecular compounds as potential starting materials for next-generation medicines. In one week, we performed over 500 million docking computations to evaluate how a protein responds to a given molecule. Each computation involved a docking program that predicted the preferred orientation of a small molecule to a protein and the associated energetics so we could assess whether or not it will bind and alter the function of the target protein.

With a combination of Google Compute Engine standard and Preemptible VMs, we used up to 16,000 cores, for a total of 3 million core-hours and a cost of about $30,000. While this might sound like a lot of time and money, it's a lot less expensive and a lot faster than experimentally screening all compounds. Using a physics-based approach such as our INSITE platform is much more computationally expensive than some other computational screening approaches, but it allows us to find novel binders without the use of any prior information about active compounds (this particular target has no drug-like compounds known to bind). In a final stage of the calculations we performed all-atom molecular dynamics (MD) simulations on the top 1,000 molecules to determine which ones to purchase and experimentally assay for activity.

The bottom line: We successfully completed the screen using our INSITE platform on GCP and found several molecules that have recently been experimentally verified to have on-target and cell-based activity.

We chose to run this high-performance computing (HPC) job on GCP over other public cloud providers for a number of reasons:

Availability of high-performance compute infrastructure. Compute Engine has a good inventory of high-performance processors that can be configured with large amounts of cores and memory. It also offers GPUs — a great fit for some of our computations, such as molecular dynamics and free energy calculations. SSD made a big difference in performance, as our total I/O for this screen exceeded 40 TB of raw data. Fast connectivity between the front-end and the compute nodes was also a big factor, as the front-end disk was NFS-mounted on the compute nodes.
Support for industry standard tools. As a startup, we value the ability to run our workloads wherever we see fit. Our priorities can change rapidly based on project challenges (chemistry and biology), competition, opportunities and the availability of compute resources. Our INSITE platform is built on a combination of open-source and proprietary in-house software, so portability and repeatability across in-house and public clouds is essential.
An attractive pricing model. Preemptible VMs are great combination of cost-effective and predictable, offering up to 80% off standard instances — no bidding and no surprises. That means we don't have to worry about jobs being killed due to a bidding war, which can create significant delays in completing our screens and requires unnecessary human overhead to manage the jobs.

We initialized multiple clusters for the screening; specifically, our cluster’s front-end consisted of three full-priced n1-highmem-32 VM instances with 208GB of RAM that ran the queuing system, and that connected to a 2TB SSD NFS filestore that housed the compound library. Each of these front-end nodes then spawned up to 128 compute nodes configured as n1-highcpu-32 Preemptible VMs, each with 28.8GB of memory. Those compute nodes performed the actual molecular compound screens, and wrote their results back to the filestore. Preemptible VMs run for a maximum of 24 hours; when that time elapsed, the front-end nodes drained any jobs remaining on the compute nodes and re-spawned a new set of nodes until all 10 million compounds had been successfully run.

To manage compute jobs, we enlisted the help of two popular open-source tools: Slurm, a workload manager used by 60% of the world’s TOP500 clusters, and ElastiCluster, which provides a command-line tool to create, manage and setup compute clusters hosted on a variety of cloud infrastructures. Using these open-source packages is economical, provides the lion’s share of the functionality of paid software solutions and ensures we can run our workloads in-house or elsewhere.

More compute = better results

But ultimately, the biggest benefit of using GCP was being able to more thoroughly screen compounds than we could have done with in-house resources. The target protein in this particular study was highly flexible, and having access to massive amounts of compute power allowed us to more accurately model the underlying physics of the system by accounting for protein flexibility. This yielded more active compounds than we would have found without the GCP resources.

The reality is that all proteins are flexible, and undergo some form of induced fit upon ligand binding, so treating protein flexibility is always important in virtual screening if you want the best results. Most molecular docking programs only account for ligand flexibility, so if the receptor structure is not quite right then active compounds might not fit and therefore be missed, no matter how good the docking program is. Our INSITE screening platform incorporates protein flexibility in a novel way that can greatly improve the hit rate in virtual screening, even as it requires a lot of computational resources when screening millions of commercially available compounds.

Example of the dynamic nature of protein target (Interleukin018, IL18)

From the initial 10 million compounds, we prioritized 250 promising compounds for experimental validation in our lab. As a small company, we don't have the capabilities to experimentally screen millions of compounds, and there's no need to do so with an accurate virtual screening approach like we have in our INSITE platform. We're excited to report that at least five of these compounds have shown activity in human cells, suggesting them as promising starting points for new medicines. To our knowledge, there are no drug-like small molecule activators of this important and challenging immune-oncology target.

To learn more about the science at Silicon Therapeutics, please visit our website. And if you’re an engineer with expertise in high performance computing, GPUs and/or molecular simulations, be sure to visit our job listings.

Source: Google Cloud Platform Blog

Guest post: Using GCP for massive drug discovery virtual screening

Availability of high-performance compute infrastructure. Compute Engine has a good inventory of high-performance processors that can be configured with large amounts of cores and memory. It also offers GPUs — a great fit for some of our computations, such as molecular dynamics and free energy calculations. SSD made a big difference in performance, as our total I/O for this screen exceeded 40 TB of raw data. Fast connectivity between the front-end and the compute nodes was also a big factor, as the front-end disk was NFS-mounted on the compute nodes.
Support for industry standard tools. As a startup, we value the ability to run our workloads wherever we see fit. Our priorities can change rapidly based on project challenges (chemistry and biology), competition, opportunities and the availability of compute resources. Our INSITE platform is built on a combination of open-source and proprietary in-house software, so portability and repeatability across in-house and public clouds is essential.
An attractive pricing model. Preemptible VMs are great combination of cost-effective and predictable, offering up to 80% off standard instances — no bidding and no surprises. That means we don't have to worry about jobs being killed due to a bidding war, which can create significant delays in completing our screens and requires unnecessary human overhead to manage the jobs.

More compute = better results

Example of the dynamic nature of protein target (Interleukin018, IL18)

Source: Google Cloud Platform Blog

Google Cloud Platform now open in London

By Dave Stiver, Product Manager, Google Cloud Platform

Starting today, Google Cloud Platform (GCP) customers can use the new region in London (europe-west2) to run applications and store data in London. London is our tenth region and joins our existing European region in Belgium. Future European regions include Frankfurt, the Netherlands and Finland.

Incredible user experiences hinge on performant infrastructure. GCP customers throughout the British Isles and Western Europe will see significant reductions in latency when they run their workloads in the London region. In cities like London, Dublin, Edinburgh and Amsterdam, our performance testing shows 40%-82% reductions in round-trip time latency when serving customers from London compared with the Belgium region.

We’ve launched London with three zones and the following services:

The London region puts the control over how to deploy resources directly in the hands of GCP customers — giving them choice in some GCP services on where to run their applications and store their data. When a customer signs up for GCP services, they have three different options, depending on the service:

Regional: Run applications and store data in a specific region, e.g., London, Tokyo, Iowa, etc.
Multi-regional: Distribute applications and storage across two or more cloud regions on a given continent, e.g., Americas, Asia or Europe.
Global: Distribute applications and store data globally across our entire global network for optimal performance and redundancy.

In addition, we've worked diligently over the last decade to help customers directly address EU data protection requirements. Most recently, Google announced a commitment to GDPR compliance across GCP. The General Data Protection Regulation (GDPR), which takes effect on May 25, 2018, is the most significant piece of European privacy legislation in the last 20 years.

"Google’s decision to choose London for its latest Google Cloud Region is another vote of confidence in our world-leading digital economy and proof Britain is open for business. It's great, but not surprising, to hear they've picked the UK because of the huge demand for this type of service from the nation's firms. Earlier this week the Digital Evolution Index named us among the most innovative digital countries in the world and there has been a record £5.6bn investment in tech in London in the past six months."

— Karen Bradley, Secretary of State for Digital, Culture, Media and Sport

"At WP Engine, we look forward to extending our digital experience platform to an even broader set of our 10,000 European customers who want to be hosted on Google Cloud Platform based in the London region. We are excited about bringing reduced latency benefits from the ability to store and process data in London to our UK customers."

— Jason Cohen, Founder and CTO

"The Telegraph benefits greatly from Google Cloud’s global scale and is pleased to see continued investment from Google Cloud in the UK. We look forward to working with them closely as they expand their business in the UK and Europe."

— Toby Wright, CTO, The Telegraph

"Google Cloud enables Revolut to try new ideas and stay agile while providing secure, reliable services for our customers at scale."

— Vladyslav Yatsenko, Co-founder & CTO, Revolut

For the latest on the terms of availability for services from this new region as well as additional regions and services, visit our London region page or locations page. For guidance on how to build and create highly available applications, take a look at our zones and regions page. Give us a shout to request early access to new regions and help us prioritize what we build next.

We’re excited to see what you’ll build on top of the new London region!

Source: Google Cloud Platform Blog

Container Engine now runs Kubernetes 1.7 to drive enterprise-ready secure hybrid workloads

By Aparna Sinha, Group Product Manager, Container Engine

Just over a week ago Google led the most recent open source release of Kubernetes 1.7, and today, that version is available on Container Engine, Google Cloud Platform’s (GCP) managed container service. Container Engine is one of the first commercial Kubernetes offerings running the latest 1.7 release, and includes differentiated features for enterprise security, extensibility, hybrid networking and developer efficiency. Let’s take a look at what’s new in Container Engine.

Enterprise security

Container Engine is designed with enterprise security in mind. By default, Container Engine clusters run a minimal, Google curated Container-Optimized OS (COS) to ensure you don’t have to worry about OS vulnerabilities. On top of that, a team of Google Site Reliability Engineers continuously monitor and manage the Container Engine clusters, so you don’t have to. Now, Container Engine adds several new security enhancements:

Starting with this release, kubelet will only have access to the objects it needs to know. The Node authorizer beta restricts each kubelet’s API access to resources (such as secrets) belonging to its scheduled pods. This feature increases the protection of a cluster from a compromised/untrusted node.
Network isolation can be an important extra boundary for sensitive workloads. The Kubernetes NetworkPolicy API allows users to control which pods can communicate with each other, providing defense-in-depth and improving secure multi-tenancy. Policy enforcement can now be enabled in alpha clusters.
HTTP re-encryption through Google Cloud Load Balancing (GCLB) allows customers to use HTTPS from the GCLB to their service backends. This is an often requested feature that gives customers the peace of mind knowing that their data is fully encrypted in-transit even after it enters Google’s global network.

Together the above features improve workload isolation within a cluster, which is a frequently requested security feature in Kubernetes. Node Authorizer and NetworkPolicy can be combined with the existing RBAC control in Container Engine to improve the foundations of multi-tenancy:

Network isolation between Pods (network policy)
Resource isolation between Nodes (node authorizer)
Centralized control over cluster resources (RBAC)

Enterprise and hybrid networks

Perhaps the most awaited features by our enterprise users are networking support for hybrid cloud and VPN with Container Engine. New in this release:

GA Support for all private IP (RFC-1918) addresses, allowing users to create clusters and access resources in all private IP ranges and extending the ability to use Container Engine clusters with existing networks.
Exposing services by internal load balancing is beta, allowing Kubernetes and non-Kubernetes services to access one another on a private network¹.
Source IP preservation is now generally available and allows applications to be fully aware of client IP addresses for services exposed through Kubernetes

Enterprise extensibility

As more enterprises use Container Engine, we're making a major investment to improve extensibility. We heard feedback that customers want to offer custom Kubernetes-style APIs in their clusters.

API Aggregation, launching today in beta on Container Engine, enables you to extend the Kubernetes API with custom APIs. For example, you can now add existing API solutions such as service catalog, or build your own in the future.

Users also want to incorporate custom business logic and third-party solutions into their Container Engine clusters. So we’re introducing Dynamic Admission Control in alpha clusters, providing two ways to add business logic to your cluster:

Initializers can modify Kubernetes objects as they are created. For example, you can use an initializer to add Istio capability to a Container Engine alpha cluster, by injecting an Istio sidecar container in every Pod deployed.
Webhooks enable you to validate enterprise policy. For example, you can verify that containers being deployed pass your enterprise security audits.

As part of our plans to improve extensibility for enterprises, we're replacing the Third Party Resource (TPR) API with the improved Custom Resource Definition (CRD) API. CRDs are a lightweight way to store structured metadata in Kubernetes, which make it easy to interact with custom controllers via kubectl. If you use the TPR beta feature, please plan to migrate to CRD before upgrading to the 1.8 release.

Workload diversity

Container Engine now enhances your ability to run stateful workloads like databases and key value stores, such as ZooKeeper, with a new automated application update capability. You can:

Select from a range of StatefulSet update strategies beta, including rolling updates
Optimize roll-out speed with parallel or ordered pod provisioning, particularly useful for applications such as Kafka.

A popular workload on Google Cloud and Container Engine is training machine learning models for better predictive analytics. Many of you have requested GPUs to speed up training time, so we’ve updated Container Engine to support NVIDIA K80 GPUs in alpha clusters for experimentation with this exciting feature. We’ll support additional GPUs in the future.

Developer efficiency

When developers don’t have to worry about infrastructure, they can spend more time building applications. Kubernetes provides building blocks to de-couple infrastructure and application management, and Container Engine builds on that foundation with best-in-class automation features.

We’ve automated large parts of maintaining the health of the cluster, with auto-repair and auto-upgrade of nodes.

Auto-repair beta keeps your cluster healthy by proactively monitoring for unhealthy nodes and repairs them automatically without developer involvement.
In this release, Container Engine’s auto-upgrade beta capability incorporates Pod Disruption Budgets at the node layer, making upgrades to infrastructure and application controllers predictable and safer.

Container Engine also offers cluster- and pod-level auto-scaling so applications can respond to user demand without manual intervention. This release introduces several GCP-optimized enhancements to cluster autoscaling:

Support for scaling node pools to 0 or 1, for when you don’t need capacity
Price-based expander for auto-scaling in the most cost-effective way
Balanced scale-out of similar node groups, useful for clusters that span multiple zones

The combination of auto-repair, auto-upgrades and cluster autoscaling in Container Engine enables application developers to deploy and scale their apps without being cluster admins.

We’ve also updated the Container Engine UI to assist in debugging and troubleshooting by including detailed workload-related views. For each workload, we show the type (DaemonSet, Deployment, StatefulSet, etc.), running status, namespace and cluster. You can also debug each pod and view annotations, labels, the number of replicas and status, etc. All views are cross-cluster so if you're using multiple clusters, these views allow you to focus on your workloads, no matter where they run. In addition, we also include load balancing and configuration views with deep links to GCP networking, storage and compute. This new UI will be rolling out in the coming week.

Container Engine everywhere

Google Cloud is enabling a shift in enterprise computing: from local to global, from days to seconds, and from proprietary to open. The benefits of this model are becoming clear and exemplified by Container Engine, which saw more than 10x growth last year.

To keep up with demand, we're expanding our global capacity with new Container Engine clusters in our latest GCP regions:

Sydney (australia-southeast1)
Singapore (asia-southeast1)
Oregon (us-west1)
London (europe-west2)

These new regions join the half dozen others from Iowa to Belgium to Taiwan where Container Engine clusters are already up and running.

This blog post highlighted some of the new features available in Container Engine. You can find the complete list of new features in the Container Engine release notes.

The rapid adoption of Container Engine and its technology is translating into real customer impact. Here are a few recent stories that highlight the benefits companies are seeing:

BQ, one of the leading technology companies in Europe that designs and develops consumer electronics, was able to scale quickly from 15 to 350 services while reducing its cloud hosting costs by approximately 60% through better utilization and use of Preemptible VMs on Container Engine. Read the full story here.
Meetup, the social media networking platform, switched from a monolithic application in on-premises data centers to an agile microservices architecture in a multi-cloud environment with the help of Container Engine. This gave its engineering teams autonomy to work on features and develop roadmaps that are independent from other teams, translating into faster release schedules, greater creativity and new functionality. Read the case study here.
Loot Crate, a leader in fan subscription boxes, launched a new offering on Container Engine to quickly get their Rails app production ready and able to scale with demand and zero downtime deployments. Read how it built its continuous deployment pipeline with Jenkins in this post.

At Google Cloud we’re really proud of our compute infrastructure, but what really makes it valuable is the services that run on top. Google creates game-changing services on top of world-class infrastructure and tooling. With Kubernetes and Container Engine, Google Cloud makes these innovations available to developers everywhere.

GCP is the first cloud offering a fully managed way to try the newest Kubernetes release, and with our generous 12-month free trial of $300 credits, there’s no excuse to not try it today.

Thanks for your feedback and support. Keep the conversation going and connect with us on the Container Engine Slack channel.

1 Support for accessing Internal Load Balancers over Cloud VPN is currently in alpha; customers can apply for access here.

Source: Google Cloud Platform Blog

Guest post: Loot Crate unboxes Google Container Engine for new Sports Crate venture

By Greg Brown, Director, DevOps, Loot Crate

[Editor’s note: Gamers and superfans know Loot Crate, which delivers boxes of themed swag to 650,000 subscribers every month. Loot Crate built its back-end on Heroku, but for its next venture — Sports Crate — the company decided to containerize its Rails app with Google Container Engine, and added continuous deployment with Jenkins. Read on to learn how they did it.]

Founded in 2012, Loot Crate is the worldwide leader in fan subscription boxes, partnering with entertainment, gaming and pop culture creators to deliver monthly themed crates, produce interactive experiences and digital content and film original video productions. In our first five years, we’ve delivered over 14 million crates to fans in 35 territories across the globe.

In early 2017 we were tasked with launching an offering to Major League Baseball fans called Sports Crate. There were only a couple of months until the 2017 MLB season started on April 2nd, so we needed the site to be up and capturing emails from interested parties as fast as possible. Other items on our wish list included the ability to scale the site as traffic increased, automated zero-downtime deployments, effective secret management and to reap the benefits of Docker images. Our other Loot Crate properties are built on Heroku, but for Sports Crate, we decided to try Container Engine, which we suspected would allow our app to scale better during peak traffic, manage our resources using a single Google login and better manage our costs.

Continuous deployment with Jenkins

Our goal was to be able to successfully deploy an application to Container Engine with a simple git push command. We created an auto-scaling, dual-zone Kubernetes cluster on Container Engine, and tackled how to do automated deployments to the cluster. After a lot of research and a conversation with Google Cloud Solutions Architect Vic Iglesias, we decided to go with Jenkins Multibranch Pipelines. We followed this guide on continuous deployment on Kubernetes and soon had a working Jenkins deployment running in our cluster ready to handle deploys.

Our next task was to create a Dockerfile of our Rails app to deploy to Container Engine. To speed up build time, we created our own base image with Ruby and our gems already installed, as well as a rake task to precompile assets and upload them to Google Cloud Storage when Jenkins builds the Docker image.

Dockerfile in hand, we set up the Jenkins Pipeline to build the Docker image, push it to Google Container Registry and deploy Kubernetes and its services to our environment. We put a Jenkinsfile in our GitHub repo that uses a switch statement based on the GitHub branch name to choose which Kubernetes namespace to deploy to. (We have three QA environments, a staging environment and production environment).

The Jenkinsfile checks out our code from GitHub, builds the Docker image, pushes the image to Container Registry, runs a Kubernetes job that performs any database migrations (checking for success or failure) and runs tests. It then deploys the updated Docker image to Container Engine and reports the status of the deploy to Slack. The entire process takes under 3 minutes.

Improving secret management in the local development environment

Next, we focused on making local development easier and more secure. We do our development locally, and with our Heroku-based applications, we deploy using environment variables that we add in the Heroku config or in the UI. That means that anyone with the Heroku login and permission can see them. For Sports Crate, we wanted to make the environment variables more secure; we put them in a Kubernetes secret that the applications can easily consume, which also keeps the secrets out of the codebase and off developer laptops.

The local development environment consumes those environmental variables using a railtie that goes out to Kubernetes, retrieves the secrets for the development environment, parses them and puts them into the Rails environment. This allows our developers to "cd" into a repo and run "rails server" or "rails console" with the Kubernetes secrets pulled down before the app starts.

TLS termination and load balancing

Another requirement was to set up effective TLS termination and load balancing. We used a Kubernetes Ingress resource with an Nginx Ingress Controller, whose automatic HTTP-to-HTTPS redirect functionality isn’t available from Google Cloud Platform's (GCP) Ingress controller. Once we had the Ingress resource configured with our certificate and our Nginx Ingress controller running behind a service with a static IP, we were able to get to our application from the outside world. Things were starting to come together!

Auto-scaling and monitoring

With all of the basic pieces of our infrastructure on GCP in place, we looked towards auto-scaling, monitoring and educating our QA team on deployment practices and logging. For pod auto-scaling, we implemented a Kubernetes Horizontal Pod Autoscaler on our deployment. This checks CPU utilization and scales the pods up if we start getting a lot of traffic to our app. For monitoring, we implemented Datadog’s Kubernetes Agent and set up metrics to check for any critical issues, and send alerts to PagerDuty. We use StackDriver for logging and educated our team on how to use the StackDriver Logging console to properly drill down to the app, namespace and pod for which they wanted information.

Net-net

With launch day around the corner, we ran load tests on our new app and were amazed at how well it handled large amounts of traffic. The pods auto-scaled exactly as we needed them to and our QA team fell in love with continuous deployment with Jenkins Multibranch Pipelines. All told, Container Engine met all of our requirements, and we were up and running within a month.

Our next project is to move our other monolithic Rails apps off of Heroku and onto Container Engine as decoupled microservices that can take advantage of the newest Kubernetes features. We look forward to improving on what has already been an extremely powerful tool.

Source: Google Cloud Platform Blog

Going Hybrid with Kubernetes on Google Cloud Platform and Nutanix

By Allan Naim, Product GTM Lead, Kubernetes and Container Engine

Recently, we announced a strategic partnership with Nutanix to help remove friction from hybrid cloud deployments for enterprises. You can find the announcement blog post here.

Hybrid cloud allows organizations to run a variety of applications either on-premise or in the public cloud. With this approach, enterprises can:

Increase the speed at which they're releasing products and features
Scale applications to meet customer demand
Move applications to the public cloud at their own pace
Reduce time spent on infrastructure and increase time spent on writing code
Reduce cost by improving resource utilization and compute efficiency

The vast majority of organizations have a portfolio of applications with varying needs. In some cases, data sovereignty and compliance requirements force a jurisdictional deployment model where an application and its data must reside in an on-premises environment or within a country’s boundaries. Alternatively, mobile and IoT applications are characterized with unpredictable consumption models that make the on-demand, pay-as-you-go cloud model the best deployment target for these applications.

Hybrid cloud deployments can help deliver the security, compliance and compute power you require with the agility, flexibility and scale you need. Our hybrid cloud example will encompass three key components:

On-premise: Nutanix infrastructure
Public cloud: Google Cloud Platform (GCP)
Open source: Kubernetes and Containers

Containers provide an immutable and highly portable infrastructure that enables developers to predictably deploy apps across any environment where the container runtime engine can run. This makes it possible to run the same containerized application on bare metal, private cloud or public cloud. However, as developers move towards microservice architectures, they must solve a new set of challenges such as scaling, rolling updates, discovery, logging, monitoring and networking connectivity.

Google’s experience running our own container-based internal systems inspired us to create Kubernetes, and Google Container Engine, an open source and Google Cloud managed platform for running containerized applications across a pool of compute resources. Kubernetes abstracts away the underlying infrastructure, and provides a consistent experience for running containerized applications. Kubernetes introduces the concept of a declarative deployment model. In this model, an ops person supplies a template that describes how the application should run, and Kubernetes ensures the application’s actual state is always equal to the desired state. Kubernetes also manages container scheduling, scaling, health, lifecycle, load balancing, data persistence, logging and monitoring.

In a first phase, the Google Cloud-Nutanix partnership focuses on easing hybrid operations using Nutanix Calm as a single control plane for workload management across both on-premises Nutanix and GCP environments, using Kubernetes as the container management layer across the two. Nutanix Calm was recently announced at Nutanix .NEXT conference and once publicly available, will be used to automate provisioning and lifecycle operations across hybrid cloud deployments. Nutanix Enterprise Cloud OS supports a hybrid Kubernetes environment running on Google Compute Engine in the cloud and a Kubernetes cluster on Nutanix on-premises. Through this, customers can deploy portable application blueprints that run on both an on-premises Nutanix environment as well as in GCP.

Let’s walk through the steps involved in setting up a hybrid environment using Nutanix and GCP.

The steps involved are as follows:

Provision an on premise 4-node Kubernetes cluster using a Nutanix Calm blueprint
Provision a Google Compute Engine 4-node Kubernetes cluster using the same Nutanix Calm Kubernetes blueprint, configured for Google Cloud
Use Kubectl to manage both on premise and Google Cloud Kubernetes clusters
Using Helm, we’ll deploy the same Wordpress chart on both on premise and Google Cloud Kubernetes clusters

Provisioning an on-premise Kubernetes cluster using a Nutanix Calm blueprint

You can use Nutanix Calm to provision a Kubernetes cluster on premise, and Nutanix Prism, an infrastructure management solution for virtualized data centers, to bootstrap a cluster of virtualized compute and storage. This results in a Nutanix managed pool of compute and storage that's now ready to be orchestrated by Nutanix Calm, for one-click deployment of popular commercial and open source packages.

The tools used to deploy the Nutanix and Google hybrid cloud stacks.

You can then select the Kubernetes blueprint to target the Nutanix on-premise environment.

The Calm Kubernetes blueprint pictured below configures a four-node Kubernetes cluster that includes all the base software on all the nodes and the master. We’ve also customized our Kubernetes blueprint to configure Helm Tiller on the cluster, so you can use Helm to deploy a Wordpress chart. Calm blueprints also allow you to create workflows so that configuration tasks can take place in a specified order, as shown below with the “create” action.

Now, launch the Kubernetes Blueprint:

After a couple of minutes, the Kubernetes cluster is up and running with five VMs (one master node and four worker nodes):

Provisioning a Kubernetes cluster on Google Compute Engine with the same Nutanix Calm Kubernetes blueprint

Using Nutanix Calm, you can now deploy the Kubernetes blueprint onto GCP. The Kubernetes cluster is up and running on Compute Engine within a couple of minutes, again with five VMs (one master node + four worker nodes):

You’re now ready to deploy workloads across the hybrid environment. In this example, you'll deploy a containerized WordPress stack.

Using Kubectl to manage both on-premise and Google Cloud Kubernetes clusters

Kubectl is a command line interface tool that comes with Kubernetes to run commands against Kubernetes clusters.

You can now target each Kubernetes cluster across the hybrid environment and use kubectl to run basic commands. First, ssh into your on-premise environment and run a few commands.

# List out the nodes in the cluster

$ kubectl get nodes

NAME          STATUS    AGE
10.21.80.54   Ready     16m
10.21.80.59   Ready     16m
10.21.80.65   Ready     16m
10.21.80.67   Ready     16m

# View the cluster config

$ kubectl config view

apiVersion: v1
clusters:
- cluster:
    server: http://10.21.80.66:8080
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    user: default-admin
  name: default-context
current-context: default-context
kind: Config
preferences: {}
users: []

# Describe the storageclass configured. This is the Nutanix storage volume plugin for Kubernetes

$ kubectl get storageclass

NAME      KIND
silver    StorageClass.v1.storage.k8s.io

$ kubectl describe storageclass silver

Name:  silver
IsDefaultClass: No
Annotations: storageclass.kubernetes.io/is-default-class=true
Provisioner: kubernetes.io/nutanix-volume

Using Helm, you can deploy the same WordPress chart on both on-premise and Google Cloud Kubernetes clusters

This example uses Helm, a package manager used to install and manage Kubernetes applications. In this example, the Calm Kubernetes blueprint includes Helm as part of the cluster setup. The on-premise Kubernetes cluster is configured with Nutanix Acropolis, a storage provisioning system, which automatically creates Kubernetes persistent volumes for the WordPress pods.

Let’s deploy WordPress on-premise and on Google Cloud:

# Deploy wordpress

$ helm install wordpress-0.6.4.tgz

NAME:   quaffing-crab
LAST DEPLOYED: Sun Jul  2 03:32:21 2017
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/Secret
NAME                     TYPE    DATA  AGE
quaffing-crab-mariadb    Opaque  2     1s
quaffing-crab-wordpress  Opaque  3     1s

==> v1/ConfigMap
NAME                   DATA  AGE
quaffing-crab-mariadb  1     1s

==> v1/PersistentVolumeClaim
NAME                     STATUS   VOLUME  CAPACITY  ACCESSMODES  STORAGECLASS  AGE
quaffing-crab-wordpress  Pending  silver  1s
quaffing-crab-mariadb    Pending  silver  1s

==> v1/Service
NAME                     CLUSTER-IP     EXTERNAL-IP  PORT(S)                     AGE
quaffing-crab-mariadb    10.21.150.254         3306/TCP                    1s
quaffing-crab-wordpress  10.21.150.73       80:32376/TCP,443:30998/TCP  1s

==> v1beta1/Deployment
NAME                     DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
quaffing-crab-wordpress  1        1        1           0          1s
quaffing-crab-mariadb

Then, you can run a few kubectl commands to browse the on-premise deployment.

# Take a look at the persistent volume claims 

$ kubectl get pvc

NAME                      STATUS    VOLUME                                                                               CAPACITY   ACCESSMODES   AGE
quaffing-crab-mariadb     Bound     94d90daca29eaafa7439b33cc26187536e2fcdfc20d78deddda6606db506a646-nutanix-k8-volume   8Gi        RWO           1m
quaffing-crab-wordpress   Bound     764e5462d809a82165863af8423a3e0a52b546dd97211dfdec5e24b1e448b63c-nutanix-k8-volume   10Gi       RWO           1m

# Take a look at the running pods

$ kubectl get po

NAME                                      READY     STATUS    RESTARTS   AGE
quaffing-crab-mariadb-3339155510-428wb    1/1       Running   0          3m
quaffing-crab-wordpress-713434103-5j613   1/1       Running   0          3m

# Take a look at the services exposed

$ kubectl get svc

NAME                      CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
kubernetes                10.254.0.1              443/TCP                      16d
quaffing-crab-mariadb     10.21.150.254           3306/TCP                     4m
quaffing-crab-wordpress   10.21.150.73    #.#.#.#     80:32376/TCP,443:30998/TCP   4m

This on-premise environment did not have a load balancer provisioned, so we used the cluster IP to browse the WordPress site. The Google Cloud WordPress deployment automatically assigned a load balancer to the WordPress service along with an external IP address.

Summary

Nutanix Calm provided a one-click consistent deployment model to provision a Kubernetes cluster on both Nutanix Enterprise Cloud and Google Cloud.
Once the Kubernetes cluster is running in a hybrid environment, you can use the same tools (Helm, kubectl) to deploy containerized applications targeting the respective environment. This represents a “write once deploy anywhere” model.
Kubernetes abstracts away the underlying infrastructure constructs, making it possible to consistently deploy and run containerized applications across heterogeneous cloud environments

Next steps

Get started on Google Cloud Platform (GCP)
Visit Kubernetes getting started site and code
Join Kubernetes community and Slack chat
Follow Kubernetes on Twitter
Learn about Nutanix Calm
If you have feedback and/or questions, reach out to us here

Source: Google Cloud Platform Blog

Making the most of an SRE service takeover – CRE life lessons

By Adrian Hilton, Customer Reliability Engineer

In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.

Onboarding preparation

If a service entrance review determines that the service is suitable for SRE support, developers and the SRE team move into the “onboarding” phase, where they prepare for SREs to support the service.

While developers address the action items, the SRE team starts to familiarize itself with the service, building up service knowledge and familiarity with the existing monitoring tools, alerts and crisis procedures. This can be accomplished through several methods:

Education: present the new service to the rest of the team through tech talks, discussion sessions and "wheel of misfortune" scenarios.
“Take the pager for a spin”: share pager alerts with the developers for a week, and assess each page on the axes of criticality (does this indicate a user-impacting problem with the service?) and actionability (is there a clear path for the on-call to to resolve the underlying issue?). This gives the SRE team a quantitative measure of how much operational load the service is likely to impose.
On-call shadow: page the primary on-call developer and SRE at the same time. At this stage, responsibility for dealing with emergencies rests on the developer, but the developer and the SRE collaborate on debugging and resolving production issues together.

Measuring success

Q: I’ve gone through a lot of effort to make my service ready to hand over to SRE. How can I tell whether it was a good expenditure of scarce engineering time?

If the developer and SRE teams have agreed to hand over a system, they should also agree on criteria (including a timeframe) to measure whether the handover was successful. Such criteria may include (with appropriate numbers):

Absolute decrease of paging/outages count
Decreasing paging/outages as a proportion of (increasing) service scale and complexity.
Reduced time/toil from the point of new code passing tests to being deployed globally, and a flat (or decreasing) rollback rate.
Increased utilization of reserved resources (CPU, memory, disk etc.)

Setting these criteria can then prepare the ground for future handover proposals; if the success criteria for a previous handover were not met, the teams should carefully reconsider how this will change the handover plans for a new service.

Taking over the pager

Once all the blocking action items have been resolved, it’s time for SREs to take over the service pager. This should be a "no drama" event, with few, well-documented service alerts, that can be easily resolved by following procedures in the service playbook.

In theory, the SRE team will have identified most of these issues in the entrance review phase, but realistically there any many issues that are only apparent with sustained exposure to a service.

In the medium term (one to two months), SREs should build a list of deficiencies or areas for optimization in the system with regard to monitoring, resource consumption etc. This hitlist should primarily aim to reduce SRE “toil” (manual, repetitive, tactical work that has no enduring value), and secondarily fix aspects of the system, e.g., resource consumption or cruft accumulation, which can impact system performance. Tertiary changes may include things like updating the documentation to facilitate onboarding new SREs for system support.

In the long term (three to six months), SREs should expect to meet most or all of the pre-established measurements for takeover success as described above.

Q: That’s great, so now my developers can turn off their pager?

Not so fast, my friend. Although the SRE team has learned a lot about the service in the preceding months, they're still not experts; there will inevitably be failure modes involving arcane service behavior where the SRE on-call will not know what has broken, or how to fix it. There's no substitute for having a developer available, and we normally require developers to keep their on-call rotation so that the SRE on-call can page them if needed. We expect this to be a low rate of pages.

The nuclear option — handing back the pager

Not all SRE takeovers go smoothly, and even if the SREs have taken over the pager for a service, it’s possible for reliability to regress or operational load to increase. This might be for good reasons such as a “success disaster” — a sustained and unexpected spike in usage — or for bad reasons such as poor QA testing.

An SRE team can only handle so many services, and if one service starts to consume a disproportionate amount of SRE time, it's at risk of crowding out other services. In this case, the SRE team should proactively tell the developer team that they have a problem, and should do so in a neutral way that’s data-heavy:

In the past month we’ve seen 100 pages/week for service S1, compared to a steady rate of 20-30 pages/week over the past few weeks. Even though S1 is within SLO, the pages are dominating our operational work and crowding out service improvement work. You need to do one of the following:

bring S1’s paging rate down to the original rate by reducing S1’s rate of change
de-tune S1’s alerts so that most of them no longer page
tell us to drop SRE support for services S2, S3 so our overall paging rate remains steady
tell us to drop SRE support for S1

This lets the developer team decide what’s most important to them, rather than the SRE team imposing a solution.

There are also times when developers and SREs agree that handing back the pager to developers is the right thing to do, even if the operational load is normal. For example, imagine SREs are supporting a service, and developers come up with a new, shiny, higher-performing version. Developers support the new version initially, while working out its kinks, and migrate more and more users to it. Eventually the new version is the most heavily used — this is when SREs should take on the pager for the new service and hand the old service’s pager back to developers. Developers can then finish user migrations and turn down the old service at their convenience.

Converging your SRE and dev teams

Onboarding a service is about more than transferring responsibility from developers to SREs — it also improves mutual understanding between the two teams. The dev team gets to know what the SRE team does, and why, who the individual SREs are, and perhaps how they got that way. Similarly the SRE team gains a better understanding of the development team’s work and concerns. This increase in empathy is a Good Thing in itself, but is also an opportunity to improve future applications.

Now, when a developer team designs a new application or service, they should take the opportunity to invite the SRE team to the discussion. SRE teams can easily spot reliability issues in the design, and advise developers on ways to make the service easier to operate, set up good monitoring and configure sensible rollout policies from the start.

Similarly, when the SREs do future planning or design new tooling, they should include developers in the discussions; developers can advise them on future launches and projects, and give feedback on making the tools easier to operate or a better fit for developers’ needs.

Imagine that there was a brick wall between the SRE and developer teams; our original plan for service takeover was to throw the service over the wall and hope. Over the course of these blog posts, we’ve shown you how to make a hole in the wall so there can be two-way communication as the service is passed through, then expand it into a doorway so that SREs can come into the developers’ backyard and vice versa. Eventually, developers and SREs should tear down the wall entirely, and replace it with a low hedge and ornamental garden arch. SREs and developers should be able to see what’s going on in each others’ yard, and wander over to the other side as needed.

Summary

When an SRE takes on pager responsibility for developer-supported service, don’t just throw it over the fence into their yard. Work with the SRE team to help them understand how the service works and how it breaks, and to find ways to make it more resilient and easier to support. Make sure that supporting your service is a good use of the SRE team’s time, making use of their particular skills. With a carefully-planned handover process, you can both be confident that the queries will flow and your pagers will be (mostly) silent.

Source: Google Cloud Platform Blog

Reimagining virtual private clouds

By Zach Pohlman, Cloud Solutions Architect

At Cloud Next '17 this year, we announced our reimagining of Virtual Private Cloud (VPC), a product that used to be known as GCP Virtual Networks. Today, we thought we’d share a little more insight into what’s different about VPC and what it can do.

Virtual Private Cloud offers you a privately administered space within Google Cloud Platform (GCP), providing the flexibility to scale and control how workloads connect regionally and globally. This means global connectivity across locations and regions, and the elimination of silos across projects and teams. When you connect your on-premise or remote resources to GCP, you’ll have global access to your VPCs without needing to replicate connectivity or administrative policies per region.

Here’s a little more on what that means.

VPC is global. Unlike traditional VPCs that communicate across the public internet, requiring redundant, complex VPNs and interconnections to maintain security, a single Google Cloud VPC can span multiple regions. Single connection points to on-premise resources via VPN or Cloud Interconnect provide private access, reducing costs and configuration complexity.

VMs in VPC do not need VPNs to communicate between regions. Inter-region traffic is both encrypted and kept on Google's private network.

VPC is sharable. With a single VPC for an entire organization, you can build multi-tenant architectures and share single private network connectivity between teams and projects with a centralized security model. Your teams can use the network as plug-and-play, instead of stitching connectivity with VPNs. Shared VPC also allows teams to be isolated within projects, with separate billing and quotas, yet still maintain a shared IP space and access to commonly used services such as Interconnect or BigQuery.

A single network can be shared across teams and regions, all within the same administrative domain, preventing duplicate work.

VPC is expandable. Google Cloud VPCs let you increase the IP space of any subnets without any workload shutdown or downtime. This gives you flexibility and growth options to meet your needs. If you initially build on an IP space of /24s, for example, but need to grow this in one or multiple regions, you can do so quickly and easily without impacting your users.

In Google VPC, the expanded IP range is available in the new zone without rebooting the running VMs. In other VPCs this incurs downtime.

VPC is private. With Google VPC you get private access to Google services, such as storage, big data, analytics or machine learning, without having to give your service a public IP address. Configure your application’s front-end to receive internet requests and shield your back-end services from public endpoints, all while being able to access Google Cloud services.

Within Google Cloud, services are directly addressable across regions using private networks and IP addresses without crossing the best-effort public internet.

Global VPCs are divided into regional subnets that use Google’s private backbone to communicate as needed. This allows you to easily distribute different parts of your application across multiple regions to enhance uptime, reduce end-user latency or address data sovereignty needs.

With these enhancements, GCP is delivering alternatives for increasingly complex networks and workloads, and enhancing the abilities for organizations to create and manage spaces in the cloud that map closely to business requirements. You can learn more about Google Virtual Private Clouds at https://cloud.google.com/vpc/.

Source: Google Cloud Platform Blog

Choosing the right compute option in GCP: a decision tree

By Terrence Ryan, Developer Advocate and Adam Glick, Product Marketing Manager

When you start a new project on Google Cloud Platform (GCP), one of earliest decisions you make is which computing service to use: Google Compute Engine, Google Container Engine, App Engine or even Google Cloud Functions and Firebase.

GCP offers a range of compute services that go from giving users full control (i.e., Compute Engine) to highly-abstracted (i.e., Firebase and Cloud Functions), letting Google take care of more and more of the management and operations along the way.

Here’s how many long-time readers of our blog think about GCP compute options. If you're used to managing VMs and want a similar experience in the cloud, pick Compute Engine. If you use containers and Kubernetes, you can abstract away some of the necessary management overhead by using Container Engine. If you want to focus on your code and avoid the infrastructure pieces entirely, use App Engine. Finally, if you want to focus purely on code and build microservices that expose API endpoints for your applications, use Firebase and Cloud Functions.

Over the years, you've told us that this model works great if you have no constraints, but can be challenging if you do. We’ve heard your feedback and propose another way to choose your compute options using a constraint-based set of questions. (It should go without saying that we’re considering very small aspects of your project.)

1. Are you building a mobile or HTML application that does its heavy lifting, processing-wise, on the client? If you're building a thick client that only relies on a backend for synchronization and/or storage, Firebase is a great option. Firebase allows you to store complex NoSQL documents (or objects if that’s how you think of them) and files using a very easy-to-use API and client available for iOS, Android and Javascript. There’s also a REST API for access from other platforms.

2. Are you building a system based more on events than user interaction? In other words, are you building an app that responds to uploaded files, or maybe logins to other applications? Are you already looking at “serverless” or “Functions as a Service” solutions? Look no further than Cloud Functions. Cloud Functions allows you to write Javascript functions that run on Node.js and that can call any one of our APIs including Cloud Vision, Translate, Cloud Storage or over 100 others. With Cloud Functions, you can build complex individual functions that get exposed as microservices to take advantage of all our services without having to maintain systems and glue them all together.

3. Does your solution already exist somewhere else? Does it include licensed software? Does it require anything other than HTTP/S? If you answered “no,” App Engine is worth a look. App Engine is a serverless solution that runs your code on our infrastructure and charges you only for what you use. We scale it up or down for you depending on demand. In addition, App Engine has access to all the Google SDKs available so you can take advantage of the full Google Cloud ecosystem.

4. Are you looking to build a container-based system built on Kubernetes? If you're already using Kubernetes on GCP, you should really consider Container Engine. (You should think about it wherever you're going to run Kubernetes actually.) Container Engine reduces building a Kubernetes solutions to a single click. Additionally, it auto-scales Kubernetes cluster members, allowing you to build Kubernetes solutions that grow and contract based on demand.

5. Are you building a stateful system? Are you looking to use GPUs in your solution? Are you building a non-Kubernetes container-based solution? Are you migrating an existing on-prem solution to the cloud? Are you using licensed software? Are using protocols other than HTTP/S? Have you not found another solution to meet your needs? If you answered “yes” to any of these questions, you’re probably going to need to run your solution on virtual machines on Compute Engine. Compute Engine is our most flexible computing product, and allows you the most freedom to configure and manage your VMs however you like.

Put all of these questions together and you get the following flowchart:

This is by no means a comprehensive decision tree, and each one of our products supports a wider range of use cases than is presented here. But this should be a good guide to get you started.

To find out more about or computing solutions please check out Computing on Google Cloud Platform and then try it out for yourself today with $300 in free credits when you sign up.

Happy building!

Source: Google Cloud Platform Blog

Solution guide: Building connected vehicle apps with Cloud IoT Core

By Charles Baer, Solutions Architect

With the Internet of Things (IoT), vehicles are evolving from self-contained commodities focused on transportation to sophisticated, Internet-connected endpoints often capable of two-way communication. The new data streams generated by modern connected vehicles drive innovative business models such as usage-based insurance, enable new in-vehicle experiences and build the foundation for advances such as autonomous driving and vehicle-to-vehicle (V2V) communication.
Through all this, we here at Google Cloud are excited to help make this world a reality. We recently published a solution guide that describes how various Google Cloud Platform (GCP) services fit into the picture.

A data deluge

Vehicles can produce upwards of 560 GB data per vehicle, per day. This deluge of data represents both incredible opportunities and daunting challenges for the platforms that connect and manage vehicle data, including:

Device management. Connecting devices to any platform requires authentication, authorization, the ability to push update software, configuration and monitoring. These services must be able to scale to millions of devices and constant availability.
Data ingestion. Messages must be reliably received, processed and stored.
Data analytics. Complex analysis of time-series data generated from devices must be used to gain insights into event, tolerances, trends and possible failures.
Applications. Business-level application logic must be developed and integrated with existing data sources that may come from a third party or exist in on-premise data centers.
Predictive models. In order to predict business-level outcomes, predictive models based on current and historical data must be developed.

GCP services, including the recently launched Cloud IoT Core provides a robust computing platform that takes advantage of Google’s end-to-end security model. Let’s take a look at how we can implement a connected vehicle platform using Google Cloud services.

(click to enlarge)

Device Management: To handle secure device management and communications, Cloud IoT Core makes it easy for you to securely connect your globally distributed devices to GCP and centrally manage them. IoT Core Device Manager provides authentication and authorization, while IoT Core Protocol Bridge enables the messaging between the vehicles and the platform.

Data Ingestion: Cloud Pub/Sub provides a scalable data ingestion point that can handle large data volumes generated by vehicles sending GPS location, engine RPM or images. Cloud BigTable’s scalable storage services are well-suited for time series data storage and analytics.

Data Analytics: Cloud Dataflow can process data pipelines that combine the vehicle device data with corporate vehicle and customer data, then store the combined data in BigQuery. BigQuery provides a powerful analytics engine as-a-service and integrates with common visualization tools such as Tableau, Looker and Qlik.

Applications: Compute Engine, Container Engine and App Engine all provide computing components for a connected vehicle platform. Compute Engine offers a range of different machine types that make it an ideal service for any third-party integration components. Container Engine runs and manages containers, which provide a high degree of flexibility and scalability thanks to their microservices architecture. Finally, App Engine is a scalable serverless platform ideal for consumer mobile and web application frontend services.

Predictive Models: TensorFlow and Cloud Machine Learning Engine provide a sophisticated modeling framework and scalable execution environment. TensorFlow provides the framework to develop custom deep neural network models and is optimized for performance, flexibility and scale — all of which are critical when leveraging IoT-generated data. Machine Learning Engine provides a scalable environment to train TensorFlow models using specialized Google computing infrastructure hardware including GPUs and TPUs.

Summary

Vehicles are becoming sophisticated IoT devices with built-in mobile technology platforms to which third parties can connect and offer advanced services. GCP provides a secure, robust and scalable platform to connect IoT devices ranging from sophisticated head units to simple, low-powered sensors. You can learn more about the next generation of connected vehicles with GCP by reading the solution paper: Designing a Connected Vehicle Platform on Cloud IoT Core.