Category Archives: Google Cloud Platform Blog

Product updates, customer stories, and tips and tricks on Google Cloud Platform

Why should your app get SRE support? – CRE life lessons



Editor’s note: When you start to run many applications or services in your company, then you'll start to bump up against the limit of what your primary SRE (or Ops) team can support. In this installment of CRE Life Lessons we're going to look at how you can make good, principled and defensible decisions about which of your company’s applications and services you should give to your SREs to support, and how to decide when that subset needs to change.

In Google, we're fortunate to have Site Reliability Engineering (SRE) teams supporting both our horizontal infrastructure such as storage, networking and load balancing, and our major applications such as Search, Maps and Photos. Nevertheless, the combination of software engineering and system engineering skills required of the role make it hard to find and recruit SREs, and demand for them steadily outstrips supply.

Over time we’ve found some practical limits to the number of applications that an SRE team can support, and learned the characteristics of applications that are more trouble to support than others. If your company runs many production applications, your SRE team is unlikely to be able to support them all.

Q: How will I know when my company’s SRE team is at its limit? How do I choose the best subset of applications to support? When should the SRE team drop support for an application?

Good questions all; let’s explore them in more detail.

Practical limitations on SRE support


At Google, the rule of thumb for the minimum SRE team needed to staff a pager rotation without burn-out is six engineers; for a 24/7 pager rotation with a target response time under 30 minutes, we don’t want any engineer to be on-call for more than 12 continuous hours because we don’t want paging alerts interrupting their sleep. This implies two groups of six engineers each, with a wide geographic spread so that each team can handle pages mostly in their daytime.

At any one time, there's usually a designated primary who responds to pages, and a secondary who catches fall-through pages e.g., if the primary is temporarily out of contact, or is in the middle of managing an incident. The primary and secondary handle normal ops work, freeing the rest of the team for project work such as improving reliability, building better monitoring or increasing automation of ops tasks. Therefore every engineer has two weeks out of six focused on operational work -- one as primary, one as secondary.

Q: Surely 12 to 16 engineers can handle support for all the applications your development team can feasibly write?

Actually, no. Our experience is that there is a definite cognitive limit to how many different applications or services an SRE team can effectively manage; any single engineer needs to be sufficiently familiar with each app to troubleshoot, diagnose and resolve most production problems with each app. If you want to make it easy to support many apps at once, you’ll want to make them as similar as possible: design them to use common patterns and back-end services, standardize on common tools for operational tasks like rollout, monitoring and alerting, and deploy them on similar schedules. This reduces the per-app cognitive load, but doesn’t eliminate it.

If you do have enough SREs then you might consider making two teams (again, subject to the 2 x 6 minimum staffing limit) and give them separate responsibilities. At Google, it’s not unusual for a single SRE team to split into front-end and back-end shards, each taking responsibility for supporting only that half of the system, as it grows in size. (We call this team mitosis.)

Your SRE team’s maximum number of supported services will be strongly influenced by factors such as:

  • the regular operational tasks needed to keep the services running well, for example releases, bug fixes, non-urgent alerts/bugs. These can be reduced (but not eliminated) by automation;
  • “interrupts” -- unscheduled non-critical human requests. We’ve found these awkwardly resistant to efforts to reduce them; the most effective strategy has been self-service tools that address the 50-70% of repeated queries;
  • emergency alert response, incident management and follow-up. The best way to spend less time on these is to make the service more reliable, and to have better-tuned alerts (i.e., that are actionable and which, if they fire, strongly indicate real problems with the service).


Q: What about the four weeks out of six during which an SRE isn’t doing operational work  could we use that time to increase our SRE team’s supported service capacity?

You could do this but at Google we view this as “eating your seed corn.” The goal is to have the machines do all the things that are possible for machines to do, and for that to happen you need to leave breathing room for your SREs to do project work such as producing new automation for your service. In our experience, once a team crosses the 50% ops work threshold, it quickly descends a slippery slope to 100% ops. In that condition you’re losing the engineering effort that will give you medium-to-long term operational benefits such as reducing the frequency, duration and impact of future incidents. When you move your SRE team into nearly full-time ops work, you lose the benefit of its engineering design and development skills.

Note in particular that SRE engineering project work can reduce operational load by addressing many of the factors we described above, which were limiting how many services an SRE team could support.

Given the above, you may well find yourself in a position where you want your SRE team to onboard a new service but in practice they are not able to support it on on a sustainable basis.

You’re out of SRE support capacity - now what?

At Google our working principle is that any service that’s not explicitly supported by SRE must be supported by its developer team; if you have enough developers to write a new application then you probably have enough developers to support it. Our developers tend to use the same monitoring, rollout and incident management tools as the SREs they work with, so the operational support workload is similar. In any case, we like developers that wrote an application to directly support it for a little while so they can get a good feel for how customers are experiencing it. The things they learn doing so help SREs to onboard the service later.

Q: What about the next application we want the developers to write? Won’t they be too busy supporting the current application?

This may be true  the current application may be generating a high operational workload, due to excessive alerts, or a lack of automation. However, this gives the developer team a practical incentive to spend time making the application easier to support — tuning alerts, spending developer time on automation, and reducing the velocity of functional changes.

When developers are overloaded with operational work, SREs might be able to lend operational expertise and development effort to reduce the developers’ workloads to a manageable level. However, SREs still shouldn’t take on operational responsibility for the service, as this won’t solve the fundamental problem.

When one team develops an application and another team bears the brunt of the operational work for it, moral hazard thrives. Developers want high development velocity; it’s not in their interest to spend days running down and eliminating every odd bug that occasionally causes their server to run out of memory and need to be restarted. Meanwhile, the operational team is getting paged to do those restarts several times per day it’s very much in their interest to get that bug fixed since it’s their sleep that is being interrupted. Not surprisingly, when developers bear the operational load for their own system, they too are incented to spend time making it easier to support. This also turns out to be important for persuading an SRE team to support their application, as we shall see later.

Choosing which applications to support


The easiest way to prioritize the applications for SRE to support is by revenue or other business criticality, i.e., how important it will be if the service goes down. After all, having an SRE team supporting your service should improve its reliability and availability.

Q: Sounds good to me; surely prioritizing by business impact is always the right choice?

Not always. There are services which actually don’t need much support work; a good example is a simple infrastructure service (say, a distributed key-value store) that has reached maturity and is updated only infrequently. Since nothing is really changing in the service, it’s unlikely to break spontaneously. Even if it’s a critical dependency of several user-facing applications, it might not make sense to dedicate SRE support; rather, let its developers hold the pager and handle the low volume of operational work.

In Google we consider that SRE teams have seven areas of focus that developers typically don’t:

  • Monitoring and metrics. For example, detecting response latency, error or unanswered query rate, and peak utilization of resources
  • Emergency response. Running on-call rotations, traffic-dip detection, primary/secondary/escalation, writing playbooks, running Wheels of Misfortune
  • Capacity planning. Doing quarterly projections, handling a sudden sustained load spike, running utilization-improvement projects
  • Service turn-up and turn-down. For services which run in many locations (e.g., to reduce end-user latency), planning location turn-up/down schedules and automating the process to reduce risks and operational load
  • Change management. Canarying, 1% experiments, rolling upgrades, quick-fail rollbacks, and measuring error budgets
  • Performance. Stress and load testing, resource-usage efficiency monitoring and optimization.
  • Data Integrity. Ensuring that non-reconstructible data is stored resiliently and highly available for reads, including the ability to rapidly restore it from backups


With the possible exception of “emergency response” and “data integrity,” our key-value store wouldn’t benefit substantially from any of these areas of expertise, and the marginal benefit of having SREs rather than developers support it is low. On the other hand, the opportunity cost of spending SRE support capacity on it is high; there are likely to be other applications which could benefit from more of SREs’ expertise.

One other reason that SREs might take on responsibility for an infrastructure service that doesn’t need SRE expertise if it is a crucial dependency of services they already run. In that case, there could be a significant benefit to them of having visibility into, and control of, changes to that service.

In part 2 of this blog post, we’ll take a look at how our SRE team could determine how  and indeed, whether  to onboard a business-critical service once it has been identified as able to benefit from SRE support.

Google Compute Engine ranked #1 in price-performance by Cloud Spectator



Cloud Spectator, an independent benchmarking and consulting agency, has released a new comparative benchmarking study that ranks Google Cloud #1 for price-performance and block storage performance against AWS, Microsoft Azure and IBM SoftLayer.

In January 2017, Cloud Spectator tested the overall price-performance, VM performance and block storage performance of four major cloud service providers: Google Compute Engine, Amazon Web Services, Microsoft Azure, and IBM SoftLayer. The result is a rare apples-to-apples comparison among major Cloud Service Providers (CSPs), whose distinct pricing models can make them difficult to compare.

According to Cloud Spectator, “A lack of transparency in the public cloud IaaS marketplace for performance often leads to misinformation or false assumptions.” Indeed, RightScale estimates that up to 45% of cloud spending is wasted on resources that never end up being used — a serious hit to any company’s IT budget.

The report can be distilled into three key insights, which upend common misconceptions about cloud pricing and performance:
  • Insight #1: VM performance varies across cloud providers. In testing, Cloud Spectator observed differences of up to 1.4X in VM performance and 6.1X in block storage performance.
  • Insight #2: You don’t always get what you pay for. Cloud Spectator’s study found no correlation between price and performance.
  • Insight #3: Resource contention (the “Noisy Neighbor Effect”) can affect performance — but CSPs can limit those effects. Cloud Spectator points out that noisy neighbors are a real problem with some cloud vendors. To try and handle the problem, some vendors throttle down their customers access to resources (like disks) in an attempt to compensate for other VMs (so called Noisy Neighbors) on the same host machine.

You can download the full report here, or keep reading for key findings.

Key finding: Google leads for overall price-performance

Value, defined as the ratio of price and performance, varies by 2.4x across the compared IaaS providers, with Google achieving the highest CloudSpecs Score (see Methodology, below) among the four cloud IaaS providers. This is due to strong disk performance and the most inexpensive packaged pricing found in the study.


To learn more, download “2017 Best Hyperscale Cloud Providers: AWS vs. Azure vs. Google vs. SoftLayer,” a report by Cloud Spectator.


Methodology

Cloud Spectator’s price-performance calculation, the CloudSpecs Score™, provides information on how much performance the user receives for each unit of cost. The CloudSpecs Score™ is an indexed, comparable score ranging from 0-100 indicative of value based on a combination of cost and performance. The calculation of the CloudSpecs Score™ is: price-performance_value = [VM performance score] / [VM cost] best_VM_value = max{price-performance_values} CloudSpecs Score™ = 100*price-performance_value / best_VM_value
Overall storage CloudSpecs Score™ was calculated by averaging block storage and vCPU-memory price-performance scores together so that they have equal weight for each VM size. Then, all resulting VM size scores were averaged together.


Google Cloud Platform expands to Australia with new Sydney region – open now



Starting today, developers can choose to run applications and store data in Australia using the new Google Cloud Platform (GCP) region in Sydney. This is our first GCP region in Australia and the fourth in Asia Pacific, joining Taiwan, Tokyo and the recently launched Singapore.

GCP customers down under will see significant reductions in latency when they run their applications in Sydney. Our performance testing shows 80% to 95% reductions in round-trip time (RTT) latency when serving customers from New Zealand and Australian cities such as Sydney, Auckland, Wellington, Melbourne, Brisbane, Perth and Adelaide, compared to using regions in Singapore or Taiwan.

The Sydney GCP region is launching with three zones and several GCP services, and App Engine and Datastore will be available shortly:
Google Cloud customers benefit from our commitment to large-scale infrastructure investments. With the addition of each new region, developers have more choice on how to run applications closest to their customers. Google’s networking backbone, meanwhile, transforms compute and storage infrastructure into a global-scale computer, giving developers around the world access to the same cloud infrastructure that Google engineers use every day.

In Asia-Pacific, we’re already building another region in Mumbai, as well as new network infrastructure to tie them all together, including the SJC cable and Indigo cable fiber optic systems.

What customers are saying

Here’s what the new regions means to a few of our customers and partners.
"The regional expansion of Google Cloud Platform to Australia will help enable PwC's rapidly growing need to experiment and innovate and will further extend our work with Google Cloud.

It not only provides a reliable and resilient platform that can support our firm's core technology needs, it also makes available to us, GCP's market leading technologies and capabilities to support the unprecedented demand of our diverse and evolving business."


—Hilda Clune, Chief Information Officer, PwC Australia
"Monash University has one of the most ambitious digital transformation agendas in tertiary education. We're executing our strategy at pace and needed a platform which would give us the scale, flexibility and functionality to respond rapidly to our development and processing needs. Google Cloud Platform (GCP) and in particular App Engine have been a great combination for us, and we're very excited at the results we're getting. Having Google Cloud Platform hosted now in Australia is a big bonus." 
—Trevor Woods, Chief Information Officer, Monash University
Modern geophysical technologies place a huge demand on supercomputing resources. Woodside utilises Google Cloud as an on-demand solution for our large computing requirements. This has allowed us to push technological boundaries and dramatically reduce turnaround time.
— Sean Salter, VP Technology,Woodside Energy Ltd.

Next steps

We want to help you build what’s next for you. If you’re looking for help to understand how to deploy GCP, please contact local partners: Shine Solutions, Servian, 3WKS, Axalon, Onigroup, PwCDeloitte, Glintech, Fronde or Megaport.

For more details on Australia’s first region, please visit our Sydney region page where you’ll get access to free resources, whitepapers, an on-demand training video series called "Cloud On-Air" and more. These will help you get started on GCP. Give us a shout to request early access to new regions and help us prioritize what we build next.

New Singapore GCP region – open now



The Singapore region is now open as asia-southeast1. This is our first Google Cloud Platform (GCP) region in Southeast Asia (and our third region in Asia), and it promises to significantly improve latency for GCP customers and end users in the area.

Customers are loving GCP in Southeast Asia; the total number of paid GCP customers in Singapore has increased by 100% over the last 12 months.

And the experience for GCP customers in Southeast Asia is better than ever too; performance testing shows 51% to 98% reductions in round-trip time (RTT) latency when serving customers in Singapore, Jakarta, Kuala Lumpur and Bangkok compared to using other GCP regions in Taiwan or Tokyo.

Customers with a global footprint like BBM Messenger, Carousell and Go-Jek have been looking forward to the launch of the Singapore region.
"We are excited to be able to deploy into the GCP Singapore region, as it will allow us to offer our services closer to BBM Messenger key markets. Coupled with Google's global load balancers and extensive global network, we expect to be able to provide a low latency, high-speed experience for our users globally. During our POCs, we found that GCP outperformed most vendors on key metrics such as disk I/O and network performance on like-for-like benchmarks. With sustained usage discounts and continuous support from Google's PSO and account team, we are excited to make GCP the foundation for the next generation of BBM consumer services. Matthew Talbot, CEO of Creative Media Works, the company that runs BBM Messenger Consumer globally.
"As one of the largest and fastest growing mobile classifieds marketplaces in the world, Carousell needed a platform that was agile enough for a startup, but could scale quickly as we expand. We found all these qualities in the Google Cloud Platform (GCP), which gives us a level of control over our systems and environment that we didn't find elsewhere, along with access to cutting edge technologies. We're thrilled that GCP is launching in Singapore, and look forward to being inspired by the way Google does things at scale."  — Jordan Dea-Mattson, Vice President Engineering, Carousell

"We are extremely pleased with the performance of GCP, and we are excited about the opportunities opening in Indonesia and other markets, and making use of the Singapore Cloud Region. The outcomes we’ve achieved in scaling, stability and other areas have proven how fantastic it is to have Google and GCP among our key service partners." — Ajey Gore, CTO, Go-Jek
We’ve launched Singapore with two zones and the following services:
In addition, you can combine any of the services you deploy in Singapore with other GCP services around the world such as DLP, Spanner and BigQuery.

Singapore Multi-Tier Cloud Security certification

Google Cloud is pleased to announce that having completed the required assessment, it has been recommended, by an approved certification body, for Level 3 certification of Singapore's Multi-Tier Cloud Security (MTCS) standard (SS 584:2015+C1:2016). Customers can expect formal approval of Google Cloud's certification in the coming months. As a result of achieving this certification, organizations who require compliance with the strictest levels of the MTCS standard can now confidently adopt Google Cloud services and host this data on Google Cloud's infrastructure.

Next steps

If you’re looking for help to understand how to deploy GCP, please contact local partners Sakura Sky, CloudCover, Cloud Comrade and Powerupcloud.

For more details on the Singapore region, please visit our Singapore region portal, where you’ll get access to free resources, whitepapers, on-demand video series called "Cloud On-Air" and more. These will help you get started on GCP. Our locations page provides updates on other regions coming online soon. Give us a shout to request early access to new regions and help us prioritize what we build next.

Best practices for App Engine startup time: Google Cloud Performance Atlas



[Editor’s note: In the past couple of months, Colt McAnlis of Android Developers fame joined the Google Cloud developer advocate team. He jumped right in and started blogging — and vlogging  for the new Google Cloud Performance Atlas series, focused on extracting the best performance from your GCP assets. Check out this synopsis of his first video, where he tackles the problem of cold boot performance in App Engine standard environment. Vroom vroom!]

One of the fantastic features of App Engine standard environment is that it has load balancing built into it, and can spin up or spin down instances based upon traffic demands. This is great in situations where your content goes viral, or for daily ebb-and-flows of traffic, since you don’t have to spend time thinking about provisioning whatsoever.

As a baseline, it’s easy to establish that App Engine startup time is really fast. The following graph charts instance type vs. startup time for a basic Hello World application:


250ms is pretty fast to boot up an App Engine F2 type instance class. That’s faster than fetching a Javascript file from most CDNs on a 4G connection, and shows that App Engine responds quickly to requests to create new instances.

There are great resources that detail how App Engine manages instances, but for our purposes, there’s one main concept we’re concerned with: loading requests.

A loading request triggers App Engine’s load balancer to spin up a new instance. This is important to note, since the response time for a loading request will be significantly higher than average, since the request must wait for the instance to boot up before it's serviced.

As such, the key to being able to respond to rapid load balancing while keeping user experience high is to optimize the cold-boot performance of your App Engine application. Below, we’ve gathered a few suggestions on addressing the most common problems to cold-boot performance.

Leverage resident instances

Resident instances are instances that stick around regardless of the type of load your app is handling; even when you’ve scaled to zero, these instances will still be alive.

When spikes do occur, resident instances service requests that cannot be serviced in the time it would take to spin up a new instance; requests are routed to them while a new instance spins up. Once the new instance is up, traffic is routed to it and the resident instance goes back to being idle.


The point here is that resident instances are the key to rapid scale and not shooting users’ perception of latency through the roof. In effect, resident instances hide instance startup time from the user, which is a good thing!

For more information, check our our Cloud Performance Atlas article on how Resident instances helped a developer reduce their startup time.

Be careful with initializing global variables during parallel requests

While using global variables is a common programming practice, they can create a performance pitfall in certain scenarios relating to cold boot performance. If your global variable is initialized during the loading request AND you’ve got parallel requests enabled, your application can fall into a bit of a trap, where multiple parallel requests end up blocking, waiting on the first loading request to finish initializing of your global variable. You can see this effect in the logging snapshot below:
The very first request is our loading request, and the next batch is a set of blocked parallel requests, waiting for a global variable to initialize. You can see that these blocked requests can easily end up with 2x higher response latency, which is less than ideal.

For more info, check our our Cloud Performance Atlas article on how Global variables caused one developer a lot of headaches.

Be careful with dependencies

During cold-boot time, your application code is busy scanning and importing dependencies. The longer this takes, the longer it will take for your first line of code to execute. Some languages can optimize this process to be exceptionally fast, other languages are slower, but provide more flexibility.

And to be fair, most of the time, a standard application importing a few modules should have a negligible impact on performance. However, when third-party libraries get big enough, we start to see them do weird things with import semantics, which can mess up your boot time significantly.
Addressing dependency issues is no small feat. You might have to use warm-up requests, lazy-load your imports, or in the most extreme case, prune your dependency tree.

For more info, check our our Cloud Performance Atlas article on how the developer of a platypus-based calculator tracked down a dependency problem.


Every millisecond counts

In the end, optimizing cold-boot performance for App Engine instances is critical for scaling quickly and keeping user perception of latency in a good place. If you’d like to know more about ways to optimize your Google Cloud applications, check out the rest of the Google Cloud Performance Atlas blog posts and videos. Because when it comes to performance, every millisecond counts.

Add log statements to your application on the fly with Stackdriver Debugger Logpoints



In 2014 we launched Snapshots for Stackdriver Debugger, which gave developers the ability to examine their application’s call stack and variables in production with no impact to users. In the past year, developers have taken over three hundred thousand production snapshots across their services running on Google App Engine and on VMs and containers hosted anywhere.

Today we’re showing off Stackdriver Debugger Logpoints. With Logpoints, you can instantly add log statements to your production application without rebuilding or redeploying it. Like Snapshots, this is immensely useful when diagnosing tricky production issues that lack an obvious root cause. Even better, Logpoints fits into existing logs-based workflows.
(click to enlarge)
Adding a logpoint is as simple as clicking a line in the Debugger source viewer and typing in your new log message (just make sure that you open the Logpoints tab in the right hand pane first). If you haven’t synced your source code, you can add Logpoints by specifying the target file and line number in the right-hand pane or via the gcloud command line tools. Variables can be referenced by {variableName}. You can review the full documentation here.

Because Logpoints writes its output through your app’s existing logging mechanism, it's compatible with any logging aggregation and analysis system, including Splunk or Kibana, or you can read its output from locally stored logs. However, Stackdriver Logging customers benefit from being able to read their log output from within the Stackdriver Debugger UI.


Logpoints is already available for applications written in Java, Go, Node.js, Python and Ruby via the Stackdriver Debugger agents. As with Snapshots, this same set of languages is supported across VMs (including Google Compute Engine), containers (including Google Container Engine), and Google App Engine. Logpoints has been accessible through the gcloud command line interface for some time, and the process for using Logpoints in the CLI hasn’t changed.

Each logpoint lasts up to twenty-four hours or until it's deleted or when the application is redeployed. Adding a logpoint incurs a performance cost on par with adding an additional log statement to your code directly. However, the Stackdriver Debugger agents automatically throttle any logpoints that negatively impact your application’s performance and any logpoints or snapshots with conditions that take too long to evaluate.

At Google, we use technology like Snapshots and Logpoints to solve production problems every day to make our services more performant and reliable. We’ve heard from our customers how snapshots are the bread and butter of their problem-solving processes, and we’re excited to see how you use Logpoints to make your cloud applications better.

Partnering on open source: Google and Ansible engineers on managing GCP infrastructure



It's time for the third chapter in the Partnering on open source series. This time around, we cover some of the work we’ve done with Ansible, a popular open source IT automation engine, and how to use it to provision, manage and orchestrate Google Cloud Platform (GCP) resources.

Ansible, by Red Hat, is a simple automation language that can perfectly describe an IT application infrastructure on GCP including virtual machines, disks, network load-balancers, firewall rules and more. In this series, I'll walk you through my former life as a DevOps engineer at a satellite space imaging company. You'll get a glimpse into how I used Ansible to update satellites in orbit along with other critical infrastructure that serve imagery to interested viewers around the globe.

In this first video, we set the stage and talk about Ansible in general, before diving into hands-on walkthroughs in subsequent episodes.



Upcoming videos demonstrate how to use Ansible and GCP to:

  • Apply a camera-settings hotfix to a satellite orbiting Earth by spinning up a Google Compute Engine instance, testing the latest satellite image build and pushing the settings to the satellite.
  • Provision and manage GCP's advanced networking features like globally available load-balancers with L7 routing to serve satellite ground images on a public website.
  • Create a set of networks, routes and firewall rules with security rules to help isolate and protect the various systems involved in the imagery processing pipeline. The raw images may contain sensitive data that must be appropriately screened and scrubbed before being added to the public image repository and network security is critical.

The series wraps up with a demonstration of how to extend Ansible's capabilities by writing custom modules. The videos in this series make use of custom and publicly available modules for GCP.

Join us on YouTube to watch the upcoming videos or go back and watch the other videos on the series. You can also follow Google Cloud on YouTube, or @GoogleCloud on Twitter to find out when new videos are published. And stay tuned for more blog posts and videos about work we’re doing with open-source providers like Puppet, Chef, Cloud Foundry, Red Hat, SaltStack and others.

App Engine users, now you can configure custom domains from the API or CLI



As a developer, your job is to provide a professional branded experience for your users. If you’re developing web apps, that means you’ll need to host your application on its own custom domain accessed securely over HTTPS with an SSL certificate.

With App Engine, it’s always been easy to access applications from their own hostname, e.g., <YOUR_PROJECT_ID>.appspot.com, but custom domains and SSL certificates could only be configured through the App Engine component of the Cloud Platform Console.

Today, we’re happy to announce that you can now manage both your custom domains and SSL certificates using the new beta features of the Admin API and gcloud command-line tool. These new beta features provide improved management, including the ability to automate mapping domains and uploading SSL certificates.

We hope these new API and CLI commands will simplify managing App Engine applications, help your business scale, and ultimately, allow you to spend more time writing code.

Managing App Engine custom domains from the CLI


To get started with the CLI, first install the Google Cloud SDK.

To use the new beta commands, make sure you’ve installed the beta component:

gcloud components install beta

And if you’ve already installed that component, make sure that it's up to date:

gcloud components update

Now that you’ve installed the new beta command, verify your domain to register ownership:

gcloud beta domains verify <DOMAIN>
gcloud beta domains list-verified

After you've verified ownership, map that domain to your App Engine application:

gcloud beta app domain-mappings create <DOMAIN>

You can also map your subdomains this way. Note that as of today, only the verified owner can create mappings to a domain.

With the response from the last command, complete the mapping to your application by updating the DNS records of your domain.

To create an HTTPS connection, upload your SSL certificate:

gcloud beta app ssl-certificates create --display-name 
<CERT_DISPLAY_NAME> --certificate <CERT_DIRECTORY_PATH> 
--private-key <KEY_DIRECTORY_PATH>

Then update your domain mapping to include the certificate that you just uploaded:

gcloud beta app domain-mappings update <DOMAIN> --certificate-id 
<CERT_ID>

We're also excited to provide a single command that you can use to renew your certificate before it expires:

gcloud beta app ssl-certificates update <CERT_ID> --certificate 
<CERT_DIRECTORY_PATH> --private-key <KEY_DIRECTORY_PATH>

As with all beta releases, these commands should not yet be used in production environments. For complete details, please check out the full set of instructions, along with the API reference. If you have any questions or feedback, we’ll be watching the Google App Engine forum, you can log a public issue, or get in touch on the App Engine slack channel (#app-engine).

Solutions guide: Preparing Container Engine environments for production



Many Google Cloud Platform (GCP) users are now migrating production workloads to Container Engine, our managed Kubernetes environment. You can spin up a Container Engine cluster for development, then quickly start porting your applications. First and foremost, a production application must be resilient and fault tolerant and deployed using Kubernetes best practices. You also need to prepare the Kubernetes environment for production by hardening it. As part of the migration to production, you may need to lock down who or what has access to your clusters and applications, both from an administrative as well as network perspective.

We recently created a guide that will help you with the push towards production on Container Engine. The guide walks through various patterns and features that allow you to lock down your Container Engine workloads. The first half focuses on how to control access to the cluster administratively using IAM and Kubernetes RBAC. The second half dives into network access patterns teaching you to properly configure your environment and Kubernetes services. With the IAM and networking models locked down appropriately, you can rest assured that you're ready to start directing your users to your new applications.

Read the full solution guide for using Container Engine for production workloads, or learn more about Container Engine from the documentation.

Getting started with Shared VPC



Large organizations with multiple cloud projects value the ability to share physical resources, while maintaining logical separation between groups or departments. At Google Cloud Next '17, we announced Shared VPC, which allows you to configure and centrally manage one or more virtual networks across multiple projects in your Organization, the top level Cloud Identity Access Management (Cloud IAM) resource in the Google Cloud Platform (GCP) cloud resource hierarchy.

With Shared VPC, you can centrally manage the creation of routes, firewalls, subnet IP ranges, VPN connections, etc. for the entire organization, and at the same time allow developers to own billing, quotas, IAM permissions and autonomously operate their development projects. Shared VPC is now generally available, so let’s look at how it works and how best to configure it.

How does Shared VPC work?

We implemented Shared VPC entirely in the management control plane, transparent to the data plane of the virtual network. In the control plane, the centrally managed project is enabled as a host project, allowing it to contain one or more shared virtual networks. After configuring the necessary Cloud IAM permissions, you can then create virtual machines in shared virtual networks, by linking one or more service projects to the host project. The advantage of sharing virtual networks in this way is being able to control access to critical network resources such as firewalls and centrally manage them with less overhead.

Further, with shared virtual networks, virtual machines benefit from the same network throughput caps and VM-to-VM latency as when they're not on shared networks. This is also the case for VM-to-VPN and load balancer-to-VM communication.

To illustrate, consider a single externally facing web application server that uses services such as personalization, recommendation and analytics, all internally available, but built by different development teams.

Example topology of a Shared VPC setup.

Let’s look at the recommended patterns when designing such a virtual network in your organization.

Shared VPC administrator role

The network administrator of the shared host project should also have the XPN administrator role in the organization. This allows a single central group to configure new service projects that attach to the shared VPC host project, while also allowing them to set up individual subnetworks in the shared network and configure IP ranges, for use by administrators of specific service projects. Typically, these administrators would have the InstanceAdmin role on the service project.

Subnetworks USE permission

When connecting a service project to the shared network, we recommend you grant the service project administrators compute.subnetworks.use permission (through the NetworkUser role) on one (or more) subnetwork(s) per region, such that the subnetwork(s) are used by a single service project.

This will help ensure cleaner separation of usage of subnetworks by different teams in your organization. In the future, you may choose to associate specific network policies for each subnetwork based on which service project is using it.

Subnetwork IP ranges

When configuring subnetwork IP ranges in the same or different regions, allow sufficient IP space between subnetworks for future growth. GCP allows you to expand an existing subnetwork without affecting IP addresses owned by existing VMs in the virtual network and with zero downtime.

Shared VPC and folders

When using folders to manage projects created in your organization, place all host and service projects for a given shared VPC setup within the same folder. The parent folder of the host project should be in the parent hierarchy of the service projects, so that the parent folder of the host project contains all the projects in the shared VPC setup. When associating service projects with a host project, ensure that these projects will not move to other folders in the future, while still being linked to the host project.


Control external access

In order to control and restrict which VMs can have public IPs and thus access to the internet, you can now set up an organization policy that disables external IP access for VMs. Do this only for projects that should have only internal access, e.g. the personalization, recommendation and analytics services in the example above.

As you can see, Shared VPC is a powerful tool that can make GCP more flexible and manageable for your organization. To learn more about Shared VPC, check out the documentation.